Netdev List
 help / color / mirror / Atom feed
From: Tariq Toukan <tariqt@nvidia.com>
To: Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, <netdev@vger.kernel.org>,
	Paolo Abeni <pabeni@redhat.com>
Cc: Adithya Jayachandran <ajayachandra@nvidia.com>,
	Bobby Eshleman <bobbyeshleman@meta.com>,
	Carolina Jubran <cjubran@nvidia.com>,
	Cosmin Ratiu <cratiu@nvidia.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Daniel Jurgens <danielj@nvidia.com>,
	Daniel Zahka <daniel.zahka@gmail.com>, David Wei <dw@davidwei.uk>,
	Donald Hunter <donald.hunter@gmail.com>,
	Dragos Tatulea <dtatulea@nvidia.com>,
	Jiri Pirko <jiri@nvidia.com>, Jiri Pirko <jiri@resnulli.us>,
	Joe Damato <joe@dama.to>, Jonathan Corbet <corbet@lwn.net>,
	Kees Cook <kees@kernel.org>, Leon Romanovsky <leon@kernel.org>,
	<linux-doc@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<linux-kselftest@vger.kernel.org>, <linux-rdma@vger.kernel.org>,
	Mark Bloch <mbloch@nvidia.com>, Moshe Shemesh <moshe@nvidia.com>,
	Or Har-Toov <ohartoov@nvidia.com>,
	Parav Pandit <parav@nvidia.com>, Petr Machata <petrm@nvidia.com>,
	Ratheesh Kannoth <rkannoth@marvell.com>,
	Saeed Mahameed <saeedm@nvidia.com>,
	Shahar Shitrit <shshitrit@nvidia.com>,
	Shay Drori <shayd@nvidia.com>, Shuah Khan <shuah@kernel.org>,
	Shuah Khan <skhan@linuxfoundation.org>,
	Simon Horman <horms@kernel.org>,
	Stanislav Fomichev <sdf@fomichev.me>,
	Tariq Toukan <tariqt@nvidia.com>,
	Willem de Bruijn <willemb@google.com>,
	Gal Pressman <gal@nvidia.com>
Subject: [PATCH net-next V10 11/14] net/mlx5: qos: Remove qos domains and use shd
Date: Wed, 1 Jul 2026 10:32:51 +0300	[thread overview]
Message-ID: <20260701073254.754518-12-tariqt@nvidia.com> (raw)
In-Reply-To: <20260701073254.754518-1-tariqt@nvidia.com>

From: Cosmin Ratiu <cratiu@nvidia.com>

E-Switch QoS domains were added with the intention of eventually
implementing shared qos domains to support cross-esw scheduling in the
previous approach ([1]), but they are no longer necessary in the new
approach.

Remove QoS domains and switch to using the shd lock for protecting
against concurrent QoS modifications.
Enable the supported_cross_device_rate_nodes devlink ops attribute so
that all calls originating from devlink rate acquire the shd lock. Only
the additional entry points into QoS need to acquire the shd lock.

The wrinkle is that since shd can be NULL (e.g. on older HW without
serial number available), there needs to be a fallback locking
mechanism. The devlink instance lock cannot be used, as some code paths
into QoS (get, set & modify vport rate) happen with RTNL held, and the
existing devlink -> RTNL order prevents devlink lock usage there.

The other two options are either esw->state_lock or a new lock as
fallback when shd is NULL. This patch adds esw->state_lock, which
implies:

- 3 new lock/unlock helper pairs to acquire/release the missing lock:
  - esw_qos_{,un}lock: acquire/release esw->state_lock when shd is NULL.
  - esw_qos_shd_{,un}lock: when esw->state_lock is already held.
  - esw_qos_devlink_{,un}lock: when shd is already held.
- esw_assert_qos_lock_held now asserts esw->state_lock is held when shd
  is NULL.

Use the corresponding lock/unlock function in all places where either
shd or state_lock would need to be acquired.

Document all of this trickery next to esw_assert_qos_lock_held.

Enabling supported_cross_device_rate_nodes now is safe, because
mlx5_esw_qos_vport_update_parent rejects cross-esw parent updates.
This will change in the next patch.

[1]
https://lore.kernel.org/netdev/20250213180134.323929-1-tariqt@nvidia.com/

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/devlink.c |   1 +
 .../net/ethernet/mellanox/mlx5/core/esw/qos.c | 245 ++++++++----------
 .../net/ethernet/mellanox/mlx5/core/esw/qos.h |   3 -
 .../net/ethernet/mellanox/mlx5/core/eswitch.c |   8 -
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  13 +-
 5 files changed, 120 insertions(+), 150 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
index c31e05529fc4..b9026cc64383 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
@@ -383,6 +383,7 @@ static const struct devlink_ops mlx5_devlink_ops = {
 	.rate_node_del = mlx5_esw_devlink_rate_node_del,
 	.rate_leaf_parent_set = mlx5_esw_devlink_rate_leaf_parent_set,
 	.rate_node_parent_set = mlx5_esw_devlink_rate_node_parent_set,
+	.supported_cross_device_rate_nodes = true,
 #endif
 #ifdef CONFIG_MLX5_SF_MANAGER
 	.port_new = mlx5_devlink_sf_port_new,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
index 49c8ec0dac9a..80a28596349b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
@@ -11,53 +11,6 @@
 /* Minimum supported BW share value by the HW is 1 Mbit/sec */
 #define MLX5_MIN_BW_SHARE 1
 
-/* Holds rate nodes associated with an E-Switch. */
-struct mlx5_qos_domain {
-	/* Serializes access to all qos changes in the qos domain. */
-	struct mutex lock;
-};
-
-static void esw_qos_lock(struct mlx5_eswitch *esw)
-{
-	mutex_lock(&esw->qos.domain->lock);
-}
-
-static void esw_qos_unlock(struct mlx5_eswitch *esw)
-{
-	mutex_unlock(&esw->qos.domain->lock);
-}
-
-static void esw_assert_qos_lock_held(struct mlx5_eswitch *esw)
-{
-	lockdep_assert_held(&esw->qos.domain->lock);
-}
-
-static struct mlx5_qos_domain *esw_qos_domain_alloc(void)
-{
-	struct mlx5_qos_domain *qos_domain;
-
-	qos_domain = kzalloc_obj(*qos_domain);
-	if (!qos_domain)
-		return NULL;
-
-	mutex_init(&qos_domain->lock);
-
-	return qos_domain;
-}
-
-static int esw_qos_domain_init(struct mlx5_eswitch *esw)
-{
-	esw->qos.domain = esw_qos_domain_alloc();
-
-	return esw->qos.domain ? 0 : -ENOMEM;
-}
-
-static void esw_qos_domain_release(struct mlx5_eswitch *esw)
-{
-	kfree(esw->qos.domain);
-	esw->qos.domain = NULL;
-}
-
 enum sched_node_type {
 	SCHED_NODE_TYPE_ROOT,
 	SCHED_NODE_TYPE_VPORTS_TSAR,
@@ -104,6 +57,65 @@ struct mlx5_esw_sched_node {
 	u32 tc_bw[DEVLINK_RATE_TCS_MAX];
 };
 
+/* Locking notes:
+ * QoS changes are normally protected by the shd lock. But on older HW shd
+ * might not be created at all, so there needs to be a fallback serialization
+ * mechanism. This is esw->state_lock.
+ * Callers into QoS hold a combination of RTNL, devlink instance lock and
+ * esw->state_lock. Devlink rate ops additionally hold the shd lock if it
+ * exists.
+ * - VF rate ops use esw_qos_lock/esw_qos_unlock.
+ * - callers with esw->state_lock held use esw_qos_shd_lock/esw_qos_shd_unlock.
+ * - devlink callers use esw_qos_devlink_lock/esw_qos_devlink_unlock.
+ */
+static void esw_assert_qos_lock_held(struct mlx5_core_dev *dev)
+{
+	if (dev->shd)
+		devl_assert_locked(dev->shd);
+	else
+		lockdep_assert_held(&dev->priv.eswitch->state_lock);
+}
+
+static void esw_qos_lock(struct mlx5_core_dev *dev)
+{
+	if (dev->shd)
+		devl_lock(dev->shd);
+	else
+		mutex_lock(&dev->priv.eswitch->state_lock);
+}
+
+static void esw_qos_unlock(struct mlx5_core_dev *dev)
+{
+	if (dev->shd)
+		devl_unlock(dev->shd);
+	else
+		mutex_unlock(&dev->priv.eswitch->state_lock);
+}
+
+static void esw_qos_shd_lock(struct mlx5_core_dev *dev)
+{
+	if (dev->shd)
+		devl_lock(dev->shd);
+}
+
+static void esw_qos_shd_unlock(struct mlx5_core_dev *dev)
+{
+	if (dev->shd)
+		devl_unlock(dev->shd);
+}
+
+static void esw_qos_devlink_lock(struct mlx5_core_dev *dev)
+{
+	if (!dev->shd)
+		mutex_lock(&dev->priv.eswitch->state_lock);
+}
+
+static void esw_qos_devlink_unlock(struct mlx5_core_dev *dev)
+{
+	if (!dev->shd)
+		mutex_unlock(&dev->priv.eswitch->state_lock);
+}
+
 static int esw_qos_num_tcs(struct mlx5_core_dev *dev)
 {
 	int num_tcs = mlx5_max_tc(dev) + 1;
@@ -700,7 +712,7 @@ esw_qos_create_vports_sched_node(struct mlx5_eswitch *esw, struct netlink_ext_ac
 	struct mlx5_esw_sched_node *node;
 	int err;
 
-	esw_assert_qos_lock_held(esw);
+	esw_assert_qos_lock_held(esw->dev);
 	if (!MLX5_CAP_QOS(esw->dev, log_esw_max_sched_depth))
 		return ERR_PTR(-EOPNOTSUPP);
 
@@ -771,7 +783,7 @@ static int esw_qos_get(struct mlx5_eswitch *esw, struct netlink_ext_ack *extack)
 {
 	int err = 0;
 
-	esw_assert_qos_lock_held(esw);
+	esw_assert_qos_lock_held(esw->dev);
 	if (!refcount_inc_not_zero(&esw->qos.refcnt)) {
 		/* esw_qos_create() set refcount to 1 only on success.
 		 * No need to decrement on failure.
@@ -784,7 +796,7 @@ static int esw_qos_get(struct mlx5_eswitch *esw, struct netlink_ext_ack *extack)
 
 static void esw_qos_put(struct mlx5_eswitch *esw)
 {
-	esw_assert_qos_lock_held(esw);
+	esw_assert_qos_lock_held(esw->dev);
 	if (refcount_dec_and_test(&esw->qos.refcnt))
 		esw_qos_destroy(esw);
 }
@@ -940,7 +952,7 @@ esw_qos_vport_tc_enable(struct mlx5_vport *vport, enum sched_node_type type,
 		}
 	}
 
-	esw_assert_qos_lock_held(vport->dev->priv.eswitch);
+	esw_assert_qos_lock_held(vport->dev);
 
 	if (type == SCHED_NODE_TYPE_RATE_LIMITER)
 		err = esw_qos_create_rate_limit_element(vport_node, extack);
@@ -1038,7 +1050,7 @@ static int esw_qos_vport_enable(struct mlx5_vport *vport,
 	struct mlx5_esw_sched_node *vport_node = vport->qos.sched_node;
 	int err;
 
-	esw_assert_qos_lock_held(vport_node->esw);
+	esw_assert_qos_lock_held(vport->dev);
 
 	esw_qos_node_set_parent(vport_node, parent);
 	if (type == SCHED_NODE_TYPE_VPORT)
@@ -1064,7 +1076,7 @@ static int mlx5_esw_qos_vport_enable(struct mlx5_vport *vport, enum sched_node_t
 	struct mlx5_esw_sched_node *sched_node;
 	int err;
 
-	esw_assert_qos_lock_held(esw);
+	esw_assert_qos_lock_held(vport->dev);
 	err = esw_qos_get(esw, extack);
 	if (err)
 		return err;
@@ -1093,15 +1105,13 @@ static int mlx5_esw_qos_vport_enable(struct mlx5_vport *vport, enum sched_node_t
 
 static void mlx5_esw_qos_vport_disable_locked(struct mlx5_vport *vport)
 {
-	struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
-
-	esw_assert_qos_lock_held(esw);
+	esw_assert_qos_lock_held(vport->dev);
 	if (!vport->qos.sched_node)
 		return;
 
 	esw_qos_vport_disable(vport, NULL);
 	mlx5_esw_qos_vport_qos_free(vport);
-	esw_qos_put(esw);
+	esw_qos_put(vport->dev->priv.eswitch);
 }
 
 void mlx5_esw_qos_vport_disable(struct mlx5_vport *vport)
@@ -1109,9 +1119,9 @@ void mlx5_esw_qos_vport_disable(struct mlx5_vport *vport)
 	struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
 
 	lockdep_assert_held(&esw->state_lock);
-	esw_qos_lock(esw);
+	esw_qos_shd_lock(vport->dev);
 	mlx5_esw_qos_vport_disable_locked(vport);
-	esw_qos_unlock(esw);
+	esw_qos_shd_unlock(vport->dev);
 }
 
 static int mlx5_esw_qos_set_vport_max_rate(struct mlx5_vport *vport, u32 max_rate,
@@ -1119,7 +1129,7 @@ static int mlx5_esw_qos_set_vport_max_rate(struct mlx5_vport *vport, u32 max_rat
 {
 	struct mlx5_esw_sched_node *vport_node = vport->qos.sched_node;
 
-	esw_assert_qos_lock_held(vport->dev->priv.eswitch);
+	esw_assert_qos_lock_held(vport->dev);
 
 	if (!vport_node)
 		return mlx5_esw_qos_vport_enable(vport, SCHED_NODE_TYPE_VPORT, NULL, max_rate, 0,
@@ -1134,7 +1144,7 @@ static int mlx5_esw_qos_set_vport_min_rate(struct mlx5_vport *vport, u32 min_rat
 {
 	struct mlx5_esw_sched_node *vport_node = vport->qos.sched_node;
 
-	esw_assert_qos_lock_held(vport->dev->priv.eswitch);
+	esw_assert_qos_lock_held(vport->dev);
 
 	if (!vport_node)
 		return mlx5_esw_qos_vport_enable(vport, SCHED_NODE_TYPE_VPORT, NULL, 0, min_rate,
@@ -1147,29 +1157,27 @@ static int mlx5_esw_qos_set_vport_min_rate(struct mlx5_vport *vport, u32 min_rat
 
 int mlx5_esw_qos_set_vport_rate(struct mlx5_vport *vport, u32 max_rate, u32 min_rate)
 {
-	struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
 	int err;
 
-	esw_qos_lock(esw);
+	esw_qos_lock(vport->dev);
 	err = mlx5_esw_qos_set_vport_min_rate(vport, min_rate, NULL);
 	if (!err)
 		err = mlx5_esw_qos_set_vport_max_rate(vport, max_rate, NULL);
-	esw_qos_unlock(esw);
+	esw_qos_unlock(vport->dev);
 	return err;
 }
 
 bool mlx5_esw_qos_get_vport_rate(struct mlx5_vport *vport, u32 *max_rate, u32 *min_rate)
 {
-	struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
 	bool enabled;
 
-	esw_qos_lock(esw);
+	esw_qos_shd_lock(vport->dev);
 	enabled = !!vport->qos.sched_node;
 	if (enabled) {
 		*max_rate = vport->qos.sched_node->max_rate;
 		*min_rate = vport->qos.sched_node->min_rate;
 	}
-	esw_qos_unlock(esw);
+	esw_qos_shd_unlock(vport->dev);
 	return enabled;
 }
 
@@ -1205,7 +1213,7 @@ static int esw_qos_vport_update(struct mlx5_vport *vport,
 	u32 curr_tc_bw[DEVLINK_RATE_TCS_MAX] = {0};
 	int err;
 
-	esw_assert_qos_lock_held(vport->dev->priv.eswitch);
+	esw_assert_qos_lock_held(vport->dev);
 	if (curr_type == type && curr_parent == parent)
 		return 0;
 
@@ -1235,11 +1243,10 @@ static int esw_qos_vport_update(struct mlx5_vport *vport,
 static int esw_qos_vport_update_parent(struct mlx5_vport *vport, struct mlx5_esw_sched_node *parent,
 				       struct netlink_ext_ack *extack)
 {
-	struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
 	struct mlx5_esw_sched_node *curr_parent;
 	enum sched_node_type type;
 
-	esw_assert_qos_lock_held(esw);
+	esw_assert_qos_lock_held(vport->dev);
 	curr_parent = vport->qos.sched_node->parent;
 	if (curr_parent == parent)
 		return 0;
@@ -1503,9 +1510,9 @@ int mlx5_esw_qos_modify_vport_rate(struct mlx5_eswitch *esw, u16 vport_num, u32
 			return err;
 	}
 
-	esw_qos_lock(esw);
+	esw_qos_lock(vport->dev);
 	err = mlx5_esw_qos_set_vport_max_rate(vport, rate_mbps, NULL);
-	esw_qos_unlock(esw);
+	esw_qos_unlock(vport->dev);
 
 	return err;
 }
@@ -1582,7 +1589,7 @@ static void esw_vport_qos_prune_empty(struct mlx5_vport *vport)
 {
 	struct mlx5_esw_sched_node *vport_node = vport->qos.sched_node;
 
-	esw_assert_qos_lock_held(vport->dev->priv.eswitch);
+	esw_assert_qos_lock_held(vport->dev);
 	if (!vport_node)
 		return;
 
@@ -1594,44 +1601,26 @@ static void esw_vport_qos_prune_empty(struct mlx5_vport *vport)
 	mlx5_esw_qos_vport_disable_locked(vport);
 }
 
-int mlx5_esw_qos_init(struct mlx5_eswitch *esw)
-{
-	if (esw->qos.domain)
-		return 0;  /* Nothing to change. */
-
-	return esw_qos_domain_init(esw);
-}
-
-void mlx5_esw_qos_cleanup(struct mlx5_eswitch *esw)
-{
-	if (esw->qos.domain)
-		esw_qos_domain_release(esw);
-}
-
 /* Eswitch devlink rate API */
 
 int mlx5_esw_devlink_rate_leaf_tx_share_set(struct devlink_rate *rate_leaf, void *priv,
 					    u64 tx_share, struct netlink_ext_ack *extack)
 {
 	struct mlx5_vport *vport = priv;
-	struct mlx5_eswitch *esw;
 	int err;
 
-	esw = vport->dev->priv.eswitch;
-	if (!mlx5_esw_allowed(esw))
+	if (!mlx5_esw_allowed(vport->dev->priv.eswitch))
 		return -EPERM;
 
 	err = esw_qos_devlink_rate_to_mbps(vport->dev, "tx_share", &tx_share, extack);
 	if (err)
 		return err;
 
-	esw_qos_lock(esw);
+	esw_qos_devlink_lock(vport->dev);
 	err = mlx5_esw_qos_set_vport_min_rate(vport, tx_share, extack);
-	if (err)
-		goto out;
-	esw_vport_qos_prune_empty(vport);
-out:
-	esw_qos_unlock(esw);
+	if (!err)
+		esw_vport_qos_prune_empty(vport);
+	esw_qos_devlink_unlock(vport->dev);
 	return err;
 }
 
@@ -1639,24 +1628,20 @@ int mlx5_esw_devlink_rate_leaf_tx_max_set(struct devlink_rate *rate_leaf, void *
 					  u64 tx_max, struct netlink_ext_ack *extack)
 {
 	struct mlx5_vport *vport = priv;
-	struct mlx5_eswitch *esw;
 	int err;
 
-	esw = vport->dev->priv.eswitch;
-	if (!mlx5_esw_allowed(esw))
+	if (!mlx5_esw_allowed(vport->dev->priv.eswitch))
 		return -EPERM;
 
 	err = esw_qos_devlink_rate_to_mbps(vport->dev, "tx_max", &tx_max, extack);
 	if (err)
 		return err;
 
-	esw_qos_lock(esw);
+	esw_qos_devlink_lock(vport->dev);
 	err = mlx5_esw_qos_set_vport_max_rate(vport, tx_max, extack);
-	if (err)
-		goto out;
-	esw_vport_qos_prune_empty(vport);
-out:
-	esw_qos_unlock(esw);
+	if (!err)
+		esw_vport_qos_prune_empty(vport);
+	esw_qos_devlink_unlock(vport->dev);
 	return err;
 }
 
@@ -1667,16 +1652,14 @@ int mlx5_esw_devlink_rate_leaf_tc_bw_set(struct devlink_rate *rate_leaf,
 {
 	struct mlx5_esw_sched_node *vport_node;
 	struct mlx5_vport *vport = priv;
-	struct mlx5_eswitch *esw;
 	bool disable;
 	int err = 0;
 
-	esw = vport->dev->priv.eswitch;
-	if (!mlx5_esw_allowed(esw))
+	if (!mlx5_esw_allowed(vport->dev->priv.eswitch))
 		return -EPERM;
 
 	disable = esw_qos_tc_bw_disabled(tc_bw);
-	esw_qos_lock(esw);
+	esw_qos_devlink_lock(vport->dev);
 
 	if (!esw_qos_vport_validate_unsupported_tc_bw(vport, tc_bw)) {
 		NL_SET_ERR_MSG_MOD(extack,
@@ -1710,7 +1693,7 @@ int mlx5_esw_devlink_rate_leaf_tc_bw_set(struct devlink_rate *rate_leaf,
 	if (!err)
 		esw_qos_set_tc_arbiter_bw_shares(vport_node, tc_bw, extack);
 unlock:
-	esw_qos_unlock(esw);
+	esw_qos_devlink_unlock(vport->dev);
 	return err;
 }
 
@@ -1720,18 +1703,17 @@ int mlx5_esw_devlink_rate_node_tc_bw_set(struct devlink_rate *rate_node,
 					 struct netlink_ext_ack *extack)
 {
 	struct mlx5_esw_sched_node *node = priv;
-	struct mlx5_eswitch *esw = node->esw;
 	bool disable;
 	int err;
 
-	if (!esw_qos_validate_unsupported_tc_bw(esw, tc_bw)) {
+	if (!esw_qos_validate_unsupported_tc_bw(node->esw, tc_bw)) {
 		NL_SET_ERR_MSG_MOD(extack,
 				   "E-Switch traffic classes number is not supported");
 		return -EOPNOTSUPP;
 	}
 
 	disable = esw_qos_tc_bw_disabled(tc_bw);
-	esw_qos_lock(esw);
+	esw_qos_devlink_lock(node->esw->dev);
 	if (disable) {
 		err = esw_qos_node_disable_tc_arbitration(node, extack);
 		goto unlock;
@@ -1741,7 +1723,7 @@ int mlx5_esw_devlink_rate_node_tc_bw_set(struct devlink_rate *rate_node,
 	if (!err)
 		esw_qos_set_tc_arbiter_bw_shares(node, tc_bw, extack);
 unlock:
-	esw_qos_unlock(esw);
+	esw_qos_devlink_unlock(node->esw->dev);
 	return err;
 }
 
@@ -1756,9 +1738,9 @@ int mlx5_esw_devlink_rate_node_tx_share_set(struct devlink_rate *rate_node, void
 	if (err)
 		return err;
 
-	esw_qos_lock(esw);
+	esw_qos_devlink_lock(esw->dev);
 	err = esw_qos_set_node_min_rate(node, tx_share, extack);
-	esw_qos_unlock(esw);
+	esw_qos_devlink_unlock(esw->dev);
 	return err;
 }
 
@@ -1773,9 +1755,9 @@ int mlx5_esw_devlink_rate_node_tx_max_set(struct devlink_rate *rate_node, void *
 	if (err)
 		return err;
 
-	esw_qos_lock(esw);
+	esw_qos_devlink_lock(esw->dev);
 	err = esw_qos_sched_elem_config(node, tx_max, node->bw_share, extack);
-	esw_qos_unlock(esw);
+	esw_qos_devlink_unlock(esw->dev);
 	return err;
 }
 
@@ -1790,7 +1772,7 @@ int mlx5_esw_devlink_rate_node_new(struct devlink_rate *rate_node, void **priv,
 	if (IS_ERR(esw))
 		return PTR_ERR(esw);
 
-	esw_qos_lock(esw);
+	esw_qos_devlink_lock(esw->dev);
 	if (esw->mode != MLX5_ESWITCH_OFFLOADS) {
 		NL_SET_ERR_MSG_MOD(extack,
 				   "Rate node creation supported only in switchdev mode");
@@ -1803,10 +1785,9 @@ int mlx5_esw_devlink_rate_node_new(struct devlink_rate *rate_node, void **priv,
 		err = PTR_ERR(node);
 		goto unlock;
 	}
-
 	*priv = node;
 unlock:
-	esw_qos_unlock(esw);
+	esw_qos_devlink_unlock(esw->dev);
 	return err;
 }
 
@@ -1816,10 +1797,11 @@ int mlx5_esw_devlink_rate_node_del(struct devlink_rate *rate_node, void *priv,
 	struct mlx5_esw_sched_node *node = priv;
 	struct mlx5_eswitch *esw = node->esw;
 
-	esw_qos_lock(esw);
+	esw_qos_devlink_lock(esw->dev);
 	__esw_qos_destroy_node(node, extack);
 	esw_qos_put(esw);
-	esw_qos_unlock(esw);
+	esw_qos_devlink_unlock(esw->dev);
+
 	return 0;
 }
 
@@ -1836,7 +1818,6 @@ mlx5_esw_qos_vport_update_parent(struct mlx5_vport *vport,
 		return -EOPNOTSUPP;
 	}
 
-	esw_qos_lock(esw);
 	if (!vport->qos.sched_node && parent) {
 		enum sched_node_type type;
 
@@ -1849,7 +1830,7 @@ mlx5_esw_qos_vport_update_parent(struct mlx5_vport *vport,
 						  parent ? : esw->qos.root,
 						  extack);
 	}
-	esw_qos_unlock(esw);
+
 	return err;
 }
 
@@ -1862,14 +1843,11 @@ int mlx5_esw_devlink_rate_leaf_parent_set(struct devlink_rate *devlink_rate,
 	struct mlx5_vport *vport = priv;
 	int err;
 
+	esw_qos_devlink_lock(vport->dev);
 	err = mlx5_esw_qos_vport_update_parent(vport, node, extack);
-	if (!err) {
-		struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
-
-		esw_qos_lock(esw);
+	if (!err)
 		esw_vport_qos_prune_empty(vport);
-		esw_qos_unlock(esw);
-	}
+	esw_qos_devlink_unlock(vport->dev);
 
 	return err;
 }
@@ -1996,7 +1974,7 @@ static int mlx5_esw_qos_node_update_parent(struct mlx5_esw_sched_node *node,
 	struct mlx5_eswitch *esw = node->esw;
 	int err;
 
-	esw_qos_lock(esw);
+	esw_qos_devlink_lock(esw->dev);
 	curr_parent = node->parent;
 	if (!parent)
 		parent = esw->qos.root;
@@ -2019,8 +1997,7 @@ static int mlx5_esw_qos_node_update_parent(struct mlx5_esw_sched_node *node,
 	esw_qos_normalize_min_rate(parent, extack);
 
 out:
-	esw_qos_unlock(esw);
-
+	esw_qos_devlink_unlock(esw->dev);
 	return err;
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.h b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.h
index 0a50982b0e27..f275e850d2c9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.h
@@ -6,9 +6,6 @@
 
 #ifdef CONFIG_MLX5_ESWITCH
 
-int mlx5_esw_qos_init(struct mlx5_eswitch *esw);
-void mlx5_esw_qos_cleanup(struct mlx5_eswitch *esw);
-
 int mlx5_esw_qos_set_vport_rate(struct mlx5_vport *evport, u32 max_rate, u32 min_rate);
 bool mlx5_esw_qos_get_vport_rate(struct mlx5_vport *vport, u32 *max_rate, u32 *min_rate);
 void mlx5_esw_qos_vport_disable(struct mlx5_vport *vport);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index b67f15a8f766..b6e2c153b4f7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1885,10 +1885,6 @@ int mlx5_eswitch_enable_locked(struct mlx5_eswitch *esw, int num_vfs)
 	MLX5_NB_INIT(&esw->nb, eswitch_vport_event, NIC_VPORT_CHANGE);
 	mlx5_eq_notifier_register(esw->dev, &esw->nb);
 
-	err = mlx5_esw_qos_init(esw);
-	if (err)
-		goto err_esw_init;
-
 	if (esw->mode == MLX5_ESWITCH_LEGACY) {
 		err = esw_legacy_enable(esw);
 	} else {
@@ -2555,9 +2551,6 @@ int mlx5_eswitch_init(struct mlx5_core_dev *dev)
 		goto reps_err;
 
 	esw->mode = MLX5_ESWITCH_LEGACY;
-	err = mlx5_esw_qos_init(esw);
-	if (err)
-		goto reps_err;
 
 	mutex_init(&esw->offloads.encap_tbl_lock);
 	hash_init(esw->offloads.encap_tbl);
@@ -2612,7 +2605,6 @@ void mlx5_eswitch_cleanup(struct mlx5_eswitch *esw)
 
 	mlx5_eswitch_invalidate_wq(esw);
 	destroy_workqueue(esw->work_queue);
-	mlx5_esw_qos_cleanup(esw);
 	WARN_ON(refcount_read(&esw->qos.refcnt));
 	mutex_destroy(&esw->state_lock);
 	WARN_ON(!xa_empty(&esw->offloads.vhca_map));
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 10c4eacd43b4..c655f6e8da1c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -234,8 +234,10 @@ struct mlx5_vport {
 
 	struct mlx5_vport_info  info;
 
-	/* Protected with the E-Switch qos domain lock. The Vport QoS can
-	 * either be disabled (sched_node is NULL) or in one of three states:
+	/* Protected by either the shared devlink (dev->shd) lock or by
+	 * esw->state_lock. See esw_assert_qos_lock_held() for more details.
+	 * The Vport QoS can either be disabled (sched_node is NULL) or in one
+	 * of three states:
 	 * 1. Regular QoS (sched_node is a vport node).
 	 * 2. TC QoS enabled on the vport (sched_node is a TC arbiter).
 	 * 3. TC QoS enabled on the vport's parent node
@@ -382,7 +384,6 @@ enum {
 };
 
 struct dentry;
-struct mlx5_qos_domain;
 
 struct mlx5_eswitch {
 	struct mlx5_core_dev    *dev;
@@ -411,11 +412,13 @@ struct mlx5_eswitch {
 	atomic64_t user_count;
 	wait_queue_head_t work_queue_wait;
 
-	/* Protected with the E-Switch qos domain lock. */
+	/* QoS changes are serialized by either the shared devlink (dev->shd)
+	 * lock or by esw->state_lock. See esw_assert_qos_lock_held() for more
+	 * details.
+	 */
 	struct {
 		/* Initially 0, meaning no QoS users and QoS is disabled. */
 		refcount_t refcnt;
-		struct mlx5_qos_domain *domain;
 		/* The root node of the hierarchy. */
 		struct mlx5_esw_sched_node *root;
 	} qos;
-- 
2.44.0


  parent reply	other threads:[~2026-07-01  7:36 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-07-01  7:32 [PATCH net-next V10 00/14] devlink and mlx5: Support cross-function rate scheduling Tariq Toukan
2026-07-01  7:32 ` [PATCH net-next V10 01/14] devlink: Update nested instance locking comment Tariq Toukan
2026-07-01  7:32 ` [PATCH net-next V10 02/14] devlink: Add a helper for getting a nested-in instance Tariq Toukan
2026-07-01  7:32 ` [PATCH net-next V10 03/14] devlink: Migrate from info->user_ptr to info->ctx Tariq Toukan
2026-07-01  7:32 ` [PATCH net-next V10 04/14] devlink: Decouple rate storage from associated devlink object Tariq Toukan
2026-07-01  7:32 ` [PATCH net-next V10 05/14] devlink: Add parent dev to devlink API Tariq Toukan
2026-07-01  7:32 ` [PATCH net-next V10 06/14] devlink: Allow parent dev for rate-set and rate-new Tariq Toukan
2026-07-01  7:32 ` [PATCH net-next V10 07/14] devlink: Allow rate node parents from other devlinks Tariq Toukan
2026-07-01  7:32 ` [PATCH net-next V10 08/14] net/mlx5: qos: Use mlx5_lag_query_bond_speed to query LAG speed Tariq Toukan
2026-07-01  7:32 ` [PATCH net-next V10 09/14] net/mlx5: qos: Refactor vport QoS cleanup Tariq Toukan
2026-07-01  7:32 ` [PATCH net-next V10 10/14] net/mlx5: qos: Model the root node in the scheduling hierarchy Tariq Toukan
2026-07-01  7:32 ` Tariq Toukan [this message]
2026-07-01  7:32 ` [PATCH net-next V10 12/14] net/mlx5: qos: Support cross-device tx scheduling Tariq Toukan
2026-07-01  7:32 ` [PATCH net-next V10 13/14] selftests: drv-net: Add test for cross-esw rate scheduling Tariq Toukan
2026-07-01  7:32 ` [PATCH net-next V10 14/14] net/mlx5: Document devlink rates Tariq Toukan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260701073254.754518-12-tariqt@nvidia.com \
    --to=tariqt@nvidia.com \
    --cc=ajayachandra@nvidia.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=bobbyeshleman@meta.com \
    --cc=cjubran@nvidia.com \
    --cc=corbet@lwn.net \
    --cc=cratiu@nvidia.com \
    --cc=daniel.zahka@gmail.com \
    --cc=daniel@iogearbox.net \
    --cc=danielj@nvidia.com \
    --cc=davem@davemloft.net \
    --cc=donald.hunter@gmail.com \
    --cc=dtatulea@nvidia.com \
    --cc=dw@davidwei.uk \
    --cc=edumazet@google.com \
    --cc=gal@nvidia.com \
    --cc=horms@kernel.org \
    --cc=jiri@nvidia.com \
    --cc=jiri@resnulli.us \
    --cc=joe@dama.to \
    --cc=kees@kernel.org \
    --cc=kuba@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=mbloch@nvidia.com \
    --cc=moshe@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=ohartoov@nvidia.com \
    --cc=pabeni@redhat.com \
    --cc=parav@nvidia.com \
    --cc=petrm@nvidia.com \
    --cc=rkannoth@marvell.com \
    --cc=saeedm@nvidia.com \
    --cc=sdf@fomichev.me \
    --cc=shayd@nvidia.com \
    --cc=shshitrit@nvidia.com \
    --cc=shuah@kernel.org \
    --cc=skhan@linuxfoundation.org \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox