public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2 net 0/8] mlx5 misx fixes 2025-08-20
@ 2025-08-20 13:32 Mark Bloch
  2025-08-20 13:32 ` [PATCH V2 net 1/8] net/mlx5: Base ECVF devlink port attrs from 0 Mark Bloch
                   ` (8 more replies)
  0 siblings, 9 replies; 13+ messages in thread
From: Mark Bloch @ 2025-08-20 13:32 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller, przemyslaw.kitszel
  Cc: Tariq Toukan, Leon Romanovsky, Saeed Mahameed, netdev, linux-rdma,
	linux-kernel, Gal Pressman, Mark Bloch

Hi,

This patchset provides misc bug fixes from the team to the mlx5 core and
Eth drivers.

v1: https://lore.kernel.org/all/1755095476-414026-1-git-send-email-tariqt@nvidia.com/

Changelog:

v1->v2:
- Addressed comments made by Przemek. The order of Carolina's
  patches was adjusted to fix the issues raised, and one additional
  patch was added.

- Added a fix by Armen at the end of the patchset.

Alexei Lazar (1):
  net/mlx5e: Query FW for buffer ownership

Armen Ratner (1):
  net/mlx5e: Preserve shared buffer capacity during headroom updates

Carolina Jubran (5):
  net/mlx5: Remove default QoS group and attach vports directly to root TSAR
  net/mlx5e: Preserve tc-bw during parent changes
  net/mlx5: Destroy vport QoS element when no configuration remains
  net/mlx5: Fix QoS reference leak in vport enable error path
  net/mlx5: Restore missing scheduling node cleanup on vport enable failure

Daniel Jurgens (1):
  net/mlx5: Base ECVF devlink port attrs from 0

 .../ethernet/mellanox/mlx5/core/en/dcbnl.h    |   1 -
 .../mellanox/mlx5/core/en/port_buffer.c       |  18 +-
 .../ethernet/mellanox/mlx5/core/en_dcbnl.c    |  12 +-
 .../mellanox/mlx5/core/esw/devlink_port.c     |   4 +-
 .../net/ethernet/mellanox/mlx5/core/esw/qos.c | 183 ++++++++++--------
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |   5 -
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |   2 +
 .../net/ethernet/mellanox/mlx5/core/port.c    |  20 ++
 8 files changed, 140 insertions(+), 105 deletions(-)


base-commit: 51f27beeb79f9f92682158999bab489ff4fa16f6
-- 
2.34.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH V2 net 1/8] net/mlx5: Base ECVF devlink port attrs from 0
  2025-08-20 13:32 [PATCH V2 net 0/8] mlx5 misx fixes 2025-08-20 Mark Bloch
@ 2025-08-20 13:32 ` Mark Bloch
  2025-08-20 13:32 ` [PATCH V2 net 2/8] net/mlx5: Remove default QoS group and attach vports directly to root TSAR Mark Bloch
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Mark Bloch @ 2025-08-20 13:32 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller, przemyslaw.kitszel
  Cc: Tariq Toukan, Leon Romanovsky, Saeed Mahameed, netdev, linux-rdma,
	linux-kernel, Gal Pressman, Daniel Jurgens, Parav Pandit,
	Mark Bloch, William Tu

From: Daniel Jurgens <danielj@nvidia.com>

Adjust the vport number by the base ECVF vport number so the port
attributes start at 0. Previously the port attributes would start 1
after the maximum number of host VFs.

Fixes: dc13180824b7 ("net/mlx5: Enable devlink port for embedded cpu VF vports")
Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c
index b7102e14d23d..c33accadae0f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c
@@ -47,10 +47,12 @@ static void mlx5_esw_offloads_pf_vf_devlink_port_attrs_set(struct mlx5_eswitch *
 		devlink_port_attrs_pci_vf_set(dl_port, controller_num, pfnum,
 					      vport_num - 1, external);
 	}  else if (mlx5_core_is_ec_vf_vport(esw->dev, vport_num)) {
+		u16 base_vport = mlx5_core_ec_vf_vport_base(dev);
+
 		memcpy(dl_port->attrs.switch_id.id, ppid.id, ppid.id_len);
 		dl_port->attrs.switch_id.id_len = ppid.id_len;
 		devlink_port_attrs_pci_vf_set(dl_port, 0, pfnum,
-					      vport_num - 1, false);
+					      vport_num - base_vport, false);
 	}
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH V2 net 2/8] net/mlx5: Remove default QoS group and attach vports directly to root TSAR
  2025-08-20 13:32 [PATCH V2 net 0/8] mlx5 misx fixes 2025-08-20 Mark Bloch
  2025-08-20 13:32 ` [PATCH V2 net 1/8] net/mlx5: Base ECVF devlink port attrs from 0 Mark Bloch
@ 2025-08-20 13:32 ` Mark Bloch
  2025-08-20 13:32 ` [PATCH V2 net 3/8] net/mlx5e: Preserve tc-bw during parent changes Mark Bloch
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Mark Bloch @ 2025-08-20 13:32 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller, przemyslaw.kitszel
  Cc: Tariq Toukan, Leon Romanovsky, Saeed Mahameed, netdev, linux-rdma,
	linux-kernel, Gal Pressman, Carolina Jubran, Cosmin Ratiu,
	Mark Bloch, Vlad Buslov, Huy Nguyen, Dmytro Linkin, Parav Pandit

From: Carolina Jubran <cjubran@nvidia.com>

Currently, the driver creates a default group (`node0`) and attaches
all vports to it unless the user explicitly sets a parent group. As a
result, when a user configures tx_share on a group and tx_share on
a VF, the expectation is for the group and the VF to share bandwidth
relatively. However, since the VF is not connected to the same parent
(but to the default node), the proportional share logic is not applied
correctly.

To fix this, remove the default group (`node0`) and instead connect
vports directly to the root TSAR when no parent is specified. This
ensures that vports and groups share the same root scheduler and their
tx_share values are compared directly under the same hierarchy.

Fixes: 0fe132eac38c ("net/mlx5: E-switch, Allow to add vports to rate groups")
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/esw/qos.c | 97 +++++++------------
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  5 -
 2 files changed, 33 insertions(+), 69 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
index 91d863c8c152..cd58d3934596 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
@@ -462,6 +462,7 @@ static int
 esw_qos_vport_create_sched_element(struct mlx5_esw_sched_node *vport_node,
 				   struct netlink_ext_ack *extack)
 {
+	struct mlx5_esw_sched_node *parent = vport_node->parent;
 	u32 sched_ctx[MLX5_ST_SZ_DW(scheduling_context)] = {};
 	struct mlx5_core_dev *dev = vport_node->esw->dev;
 	void *attr;
@@ -477,7 +478,7 @@ esw_qos_vport_create_sched_element(struct mlx5_esw_sched_node *vport_node,
 	attr = MLX5_ADDR_OF(scheduling_context, sched_ctx, element_attributes);
 	MLX5_SET(vport_element, attr, vport_number, vport_node->vport->vport);
 	MLX5_SET(scheduling_context, sched_ctx, parent_element_id,
-		 vport_node->parent->ix);
+		 parent ? parent->ix : vport_node->esw->qos.root_tsar_ix);
 	MLX5_SET(scheduling_context, sched_ctx, max_average_bw,
 		 vport_node->max_rate);
 
@@ -786,48 +787,15 @@ static int esw_qos_create(struct mlx5_eswitch *esw, struct netlink_ext_ack *exta
 		return err;
 	}
 
-	if (MLX5_CAP_QOS(dev, log_esw_max_sched_depth)) {
-		esw->qos.node0 = __esw_qos_create_vports_sched_node(esw, NULL, extack);
-	} else {
-		/* The eswitch doesn't support scheduling nodes.
-		 * Create a software-only node0 using the root TSAR to attach vport QoS to.
-		 */
-		if (!__esw_qos_alloc_node(esw,
-					  esw->qos.root_tsar_ix,
-					  SCHED_NODE_TYPE_VPORTS_TSAR,
-					  NULL))
-			esw->qos.node0 = ERR_PTR(-ENOMEM);
-		else
-			list_add_tail(&esw->qos.node0->entry,
-				      &esw->qos.domain->nodes);
-	}
-	if (IS_ERR(esw->qos.node0)) {
-		err = PTR_ERR(esw->qos.node0);
-		esw_warn(dev, "E-Switch create rate node 0 failed (%d)\n", err);
-		goto err_node0;
-	}
 	refcount_set(&esw->qos.refcnt, 1);
 
 	return 0;
-
-err_node0:
-	if (mlx5_destroy_scheduling_element_cmd(esw->dev, SCHEDULING_HIERARCHY_E_SWITCH,
-						esw->qos.root_tsar_ix))
-		esw_warn(esw->dev, "E-Switch destroy root TSAR failed.\n");
-
-	return err;
 }
 
 static void esw_qos_destroy(struct mlx5_eswitch *esw)
 {
 	int err;
 
-	if (esw->qos.node0->ix != esw->qos.root_tsar_ix)
-		__esw_qos_destroy_node(esw->qos.node0, NULL);
-	else
-		__esw_qos_free_node(esw->qos.node0);
-	esw->qos.node0 = NULL;
-
 	err = mlx5_destroy_scheduling_element_cmd(esw->dev,
 						  SCHEDULING_HIERARCHY_E_SWITCH,
 						  esw->qos.root_tsar_ix);
@@ -990,13 +958,16 @@ esw_qos_vport_tc_enable(struct mlx5_vport *vport, enum sched_node_type type,
 			struct netlink_ext_ack *extack)
 {
 	struct mlx5_esw_sched_node *vport_node = vport->qos.sched_node;
-	int err, new_level, max_level;
+	struct mlx5_esw_sched_node *parent = vport_node->parent;
+	int err;
 
 	if (type == SCHED_NODE_TYPE_TC_ARBITER_TSAR) {
+		int new_level, max_level;
+
 		/* Increase the parent's level by 2 to account for both the
 		 * TC arbiter and the vports TC scheduling element.
 		 */
-		new_level = vport_node->parent->level + 2;
+		new_level = (parent ? parent->level : 2) + 2;
 		max_level = 1 << MLX5_CAP_QOS(vport_node->esw->dev,
 					      log_esw_max_sched_depth);
 		if (new_level > max_level) {
@@ -1033,9 +1004,7 @@ esw_qos_vport_tc_enable(struct mlx5_vport *vport, enum sched_node_type type,
 err_sched_nodes:
 	if (type == SCHED_NODE_TYPE_RATE_LIMITER) {
 		esw_qos_node_destroy_sched_element(vport_node, NULL);
-		list_add_tail(&vport_node->entry,
-			      &vport_node->parent->children);
-		vport_node->level = vport_node->parent->level + 1;
+		esw_qos_node_attach_to_parent(vport_node);
 	} else {
 		esw_qos_tc_arbiter_scheduling_teardown(vport_node, NULL);
 	}
@@ -1083,7 +1052,6 @@ static int esw_qos_set_vport_tcs_min_rate(struct mlx5_vport *vport,
 static void esw_qos_vport_disable(struct mlx5_vport *vport, struct netlink_ext_ack *extack)
 {
 	struct mlx5_esw_sched_node *vport_node = vport->qos.sched_node;
-	struct mlx5_esw_sched_node *parent = vport_node->parent;
 	enum sched_node_type curr_type = vport_node->type;
 
 	if (curr_type == SCHED_NODE_TYPE_VPORT)
@@ -1093,7 +1061,7 @@ static void esw_qos_vport_disable(struct mlx5_vport *vport, struct netlink_ext_a
 
 	vport_node->bw_share = 0;
 	list_del_init(&vport_node->entry);
-	esw_qos_normalize_min_rate(parent->esw, parent, extack);
+	esw_qos_normalize_min_rate(vport_node->esw, vport_node->parent, extack);
 
 	trace_mlx5_esw_vport_qos_destroy(vport_node->esw->dev, vport);
 }
@@ -1103,25 +1071,23 @@ static int esw_qos_vport_enable(struct mlx5_vport *vport,
 				struct mlx5_esw_sched_node *parent,
 				struct netlink_ext_ack *extack)
 {
+	struct mlx5_esw_sched_node *vport_node = vport->qos.sched_node;
 	int err;
 
 	esw_assert_qos_lock_held(vport->dev->priv.eswitch);
 
-	esw_qos_node_set_parent(vport->qos.sched_node, parent);
-	if (type == SCHED_NODE_TYPE_VPORT) {
-		err = esw_qos_vport_create_sched_element(vport->qos.sched_node,
-							 extack);
-	} else {
+	esw_qos_node_set_parent(vport_node, parent);
+	if (type == SCHED_NODE_TYPE_VPORT)
+		err = esw_qos_vport_create_sched_element(vport_node, extack);
+	else
 		err = esw_qos_vport_tc_enable(vport, type, extack);
-	}
 	if (err)
 		return err;
 
-	vport->qos.sched_node->type = type;
-	esw_qos_normalize_min_rate(parent->esw, parent, extack);
-	trace_mlx5_esw_vport_qos_create(vport->dev, vport,
-					vport->qos.sched_node->max_rate,
-					vport->qos.sched_node->bw_share);
+	vport_node->type = type;
+	esw_qos_normalize_min_rate(vport_node->esw, parent, extack);
+	trace_mlx5_esw_vport_qos_create(vport->dev, vport, vport_node->max_rate,
+					vport_node->bw_share);
 
 	return 0;
 }
@@ -1132,6 +1098,7 @@ static int mlx5_esw_qos_vport_enable(struct mlx5_vport *vport, enum sched_node_t
 {
 	struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
 	struct mlx5_esw_sched_node *sched_node;
+	struct mlx5_eswitch *parent_esw;
 	int err;
 
 	esw_assert_qos_lock_held(esw);
@@ -1139,10 +1106,12 @@ static int mlx5_esw_qos_vport_enable(struct mlx5_vport *vport, enum sched_node_t
 	if (err)
 		return err;
 
-	parent = parent ?: esw->qos.node0;
-	sched_node = __esw_qos_alloc_node(parent->esw, 0, type, parent);
+	parent_esw = parent ? parent->esw : esw;
+	sched_node = __esw_qos_alloc_node(parent_esw, 0, type, parent);
 	if (!sched_node)
 		return -ENOMEM;
+	if (!parent)
+		list_add_tail(&sched_node->entry, &esw->qos.domain->nodes);
 
 	sched_node->max_rate = max_rate;
 	sched_node->min_rate = min_rate;
@@ -1168,7 +1137,7 @@ void mlx5_esw_qos_vport_disable(struct mlx5_vport *vport)
 		goto unlock;
 
 	parent = vport->qos.sched_node->parent;
-	WARN(parent != esw->qos.node0, "Disabling QoS on port before detaching it from node");
+	WARN(parent, "Disabling QoS on port before detaching it from node");
 
 	esw_qos_vport_disable(vport, NULL);
 	mlx5_esw_qos_vport_qos_free(vport);
@@ -1268,7 +1237,6 @@ static int esw_qos_vport_update(struct mlx5_vport *vport,
 	int err;
 
 	esw_assert_qos_lock_held(vport->dev->priv.eswitch);
-	parent = parent ?: curr_parent;
 	if (curr_type == type && curr_parent == parent)
 		return 0;
 
@@ -1306,16 +1274,16 @@ static int esw_qos_vport_update_parent(struct mlx5_vport *vport, struct mlx5_esw
 
 	esw_assert_qos_lock_held(esw);
 	curr_parent = vport->qos.sched_node->parent;
-	parent = parent ?: esw->qos.node0;
 	if (curr_parent == parent)
 		return 0;
 
 	/* Set vport QoS type based on parent node type if different from
 	 * default QoS; otherwise, use the vport's current QoS type.
 	 */
-	if (parent->type == SCHED_NODE_TYPE_TC_ARBITER_TSAR)
+	if (parent && parent->type == SCHED_NODE_TYPE_TC_ARBITER_TSAR)
 		type = SCHED_NODE_TYPE_RATE_LIMITER;
-	else if (curr_parent->type == SCHED_NODE_TYPE_TC_ARBITER_TSAR)
+	else if (curr_parent &&
+		 curr_parent->type == SCHED_NODE_TYPE_TC_ARBITER_TSAR)
 		type = SCHED_NODE_TYPE_VPORT;
 	else
 		type = vport->qos.sched_node->type;
@@ -1654,9 +1622,10 @@ static bool esw_qos_validate_unsupported_tc_bw(struct mlx5_eswitch *esw,
 static bool esw_qos_vport_validate_unsupported_tc_bw(struct mlx5_vport *vport,
 						     u32 *tc_bw)
 {
-	struct mlx5_eswitch *esw = vport->qos.sched_node ?
-				   vport->qos.sched_node->parent->esw :
-				   vport->dev->priv.eswitch;
+	struct mlx5_esw_sched_node *node = vport->qos.sched_node;
+	struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
+
+	esw = (node && node->parent) ? node->parent->esw : esw;
 
 	return esw_qos_validate_unsupported_tc_bw(esw, tc_bw);
 }
@@ -1763,7 +1732,7 @@ int mlx5_esw_devlink_rate_leaf_tc_bw_set(struct devlink_rate *rate_leaf,
 	if (disable) {
 		if (vport_node->type == SCHED_NODE_TYPE_TC_ARBITER_TSAR)
 			err = esw_qos_vport_update(vport, SCHED_NODE_TYPE_VPORT,
-						   NULL, extack);
+						   vport_node->parent, extack);
 		goto unlock;
 	}
 
@@ -1775,7 +1744,7 @@ int mlx5_esw_devlink_rate_leaf_tc_bw_set(struct devlink_rate *rate_leaf,
 	} else {
 		err = esw_qos_vport_update(vport,
 					   SCHED_NODE_TYPE_TC_ARBITER_TSAR,
-					   NULL, extack);
+					   vport_node->parent, extack);
 	}
 	if (!err)
 		esw_qos_set_tc_arbiter_bw_shares(vport_node, tc_bw, extack);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index b0b8ef3ec3c4..45506ad56847 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -373,11 +373,6 @@ struct mlx5_eswitch {
 		refcount_t refcnt;
 		u32 root_tsar_ix;
 		struct mlx5_qos_domain *domain;
-		/* Contains all vports with QoS enabled but no explicit node.
-		 * Cannot be NULL if QoS is enabled, but may be a fake node
-		 * referencing the root TSAR if the esw doesn't support nodes.
-		 */
-		struct mlx5_esw_sched_node *node0;
 	} qos;
 
 	struct mlx5_esw_bridge_offloads *br_offloads;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH V2 net 3/8] net/mlx5e: Preserve tc-bw during parent changes
  2025-08-20 13:32 [PATCH V2 net 0/8] mlx5 misx fixes 2025-08-20 Mark Bloch
  2025-08-20 13:32 ` [PATCH V2 net 1/8] net/mlx5: Base ECVF devlink port attrs from 0 Mark Bloch
  2025-08-20 13:32 ` [PATCH V2 net 2/8] net/mlx5: Remove default QoS group and attach vports directly to root TSAR Mark Bloch
@ 2025-08-20 13:32 ` Mark Bloch
  2025-10-02 16:31   ` ChaosEsque Team
  2025-08-20 13:32 ` [PATCH V2 net 4/8] net/mlx5: Destroy vport QoS element when no configuration remains Mark Bloch
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 13+ messages in thread
From: Mark Bloch @ 2025-08-20 13:32 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller, przemyslaw.kitszel
  Cc: Tariq Toukan, Leon Romanovsky, Saeed Mahameed, netdev, linux-rdma,
	linux-kernel, Gal Pressman, Carolina Jubran, Cosmin Ratiu,
	Mark Bloch

From: Carolina Jubran <cjubran@nvidia.com>

When changing parent of a node/leaf with tc-bw configured, the code
saves and restores tc-bw values. However, it was reading the converted
hardware bw_share values (where 0 becomes 1) instead of the original
user values, causing incorrect tc-bw calculations after parent change.

Store original tc-bw values in the node structure and use them directly
for save/restore operations.

Fixes: cf7e73770d1b ("net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw")
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/esw/qos.c | 24 +++++++++----------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
index cd58d3934596..4ed5968f1638 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
@@ -102,6 +102,8 @@ struct mlx5_esw_sched_node {
 	u8 level;
 	/* Valid only when this node represents a traffic class. */
 	u8 tc;
+	/* Valid only for a TC arbiter node or vport TC arbiter. */
+	u32 tc_bw[DEVLINK_RATE_TCS_MAX];
 };
 
 static void esw_qos_node_attach_to_parent(struct mlx5_esw_sched_node *node)
@@ -609,10 +611,7 @@ static void
 esw_qos_tc_arbiter_get_bw_shares(struct mlx5_esw_sched_node *tc_arbiter_node,
 				 u32 *tc_bw)
 {
-	struct mlx5_esw_sched_node *vports_tc_node;
-
-	list_for_each_entry(vports_tc_node, &tc_arbiter_node->children, entry)
-		tc_bw[vports_tc_node->tc] = vports_tc_node->bw_share;
+	memcpy(tc_bw, tc_arbiter_node->tc_bw, sizeof(tc_arbiter_node->tc_bw));
 }
 
 static void
@@ -629,6 +628,7 @@ esw_qos_set_tc_arbiter_bw_shares(struct mlx5_esw_sched_node *tc_arbiter_node,
 		u8 tc = vports_tc_node->tc;
 		u32 bw_share;
 
+		tc_arbiter_node->tc_bw[tc] = tc_bw[tc];
 		bw_share = tc_bw[tc] * fw_max_bw_share;
 		bw_share = esw_qos_calc_bw_share(bw_share, divider,
 						 fw_max_bw_share);
@@ -1060,6 +1060,7 @@ static void esw_qos_vport_disable(struct mlx5_vport *vport, struct netlink_ext_a
 		esw_qos_vport_tc_disable(vport, extack);
 
 	vport_node->bw_share = 0;
+	memset(vport_node->tc_bw, 0, sizeof(vport_node->tc_bw));
 	list_del_init(&vport_node->entry);
 	esw_qos_normalize_min_rate(vport_node->esw, vport_node->parent, extack);
 
@@ -1231,8 +1232,9 @@ static int esw_qos_vport_update(struct mlx5_vport *vport,
 				struct mlx5_esw_sched_node *parent,
 				struct netlink_ext_ack *extack)
 {
-	struct mlx5_esw_sched_node *curr_parent = vport->qos.sched_node->parent;
-	enum sched_node_type curr_type = vport->qos.sched_node->type;
+	struct mlx5_esw_sched_node *vport_node = vport->qos.sched_node;
+	struct mlx5_esw_sched_node *curr_parent = vport_node->parent;
+	enum sched_node_type curr_type = vport_node->type;
 	u32 curr_tc_bw[DEVLINK_RATE_TCS_MAX] = {0};
 	int err;
 
@@ -1244,10 +1246,8 @@ static int esw_qos_vport_update(struct mlx5_vport *vport,
 	if (err)
 		return err;
 
-	if (curr_type == SCHED_NODE_TYPE_TC_ARBITER_TSAR && curr_type == type) {
-		esw_qos_tc_arbiter_get_bw_shares(vport->qos.sched_node,
-						 curr_tc_bw);
-	}
+	if (curr_type == SCHED_NODE_TYPE_TC_ARBITER_TSAR && curr_type == type)
+		esw_qos_tc_arbiter_get_bw_shares(vport_node, curr_tc_bw);
 
 	esw_qos_vport_disable(vport, extack);
 
@@ -1258,8 +1258,8 @@ static int esw_qos_vport_update(struct mlx5_vport *vport,
 	}
 
 	if (curr_type == SCHED_NODE_TYPE_TC_ARBITER_TSAR && curr_type == type) {
-		esw_qos_set_tc_arbiter_bw_shares(vport->qos.sched_node,
-						 curr_tc_bw, extack);
+		esw_qos_set_tc_arbiter_bw_shares(vport_node, curr_tc_bw,
+						 extack);
 	}
 
 	return err;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH V2 net 4/8] net/mlx5: Destroy vport QoS element when no configuration remains
  2025-08-20 13:32 [PATCH V2 net 0/8] mlx5 misx fixes 2025-08-20 Mark Bloch
                   ` (2 preceding siblings ...)
  2025-08-20 13:32 ` [PATCH V2 net 3/8] net/mlx5e: Preserve tc-bw during parent changes Mark Bloch
@ 2025-08-20 13:32 ` Mark Bloch
  2025-08-20 14:45   ` Przemek Kitszel
  2025-08-20 13:32 ` [PATCH V2 net 5/8] net/mlx5: Fix QoS reference leak in vport enable error path Mark Bloch
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 13+ messages in thread
From: Mark Bloch @ 2025-08-20 13:32 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller, przemyslaw.kitszel
  Cc: Tariq Toukan, Leon Romanovsky, Saeed Mahameed, netdev, linux-rdma,
	linux-kernel, Gal Pressman, Carolina Jubran, Cosmin Ratiu,
	Mark Bloch, Mohamad Haj Yahia, Saeed Mahameed

From: Carolina Jubran <cjubran@nvidia.com>

If a VF has been configured and the user later clears all QoS settings,
the vport element remains in the firmware QoS tree. This leads to
inconsistent behavior compared to VFs that were never configured, since
the FW assumes that unconfigured VFs are outside the QoS hierarchy.
As a result, the bandwidth share across VFs may differ, even though
none of them appear to have any configuration.

Align the driver behavior with the FW expectation by destroying the
vport QoS element when all configurations are removed.

Fixes: c9497c98901c ("net/mlx5: Add support for setting VF min rate")
Fixes: cf7e73770d1b ("net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw")
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/esw/qos.c | 57 ++++++++++++++++---
 1 file changed, 49 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
index 4ed5968f1638..452a948a3e6d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
@@ -1127,6 +1127,19 @@ static int mlx5_esw_qos_vport_enable(struct mlx5_vport *vport, enum sched_node_t
 	return err;
 }
 
+static void mlx5_esw_qos_vport_disable_locked(struct mlx5_vport *vport)
+{
+	struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
+
+	esw_assert_qos_lock_held(esw);
+	if (!vport->qos.sched_node)
+		return;
+
+	esw_qos_vport_disable(vport, NULL);
+	mlx5_esw_qos_vport_qos_free(vport);
+	esw_qos_put(esw);
+}
+
 void mlx5_esw_qos_vport_disable(struct mlx5_vport *vport)
 {
 	struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
@@ -1140,9 +1153,7 @@ void mlx5_esw_qos_vport_disable(struct mlx5_vport *vport)
 	parent = vport->qos.sched_node->parent;
 	WARN(parent, "Disabling QoS on port before detaching it from node");
 
-	esw_qos_vport_disable(vport, NULL);
-	mlx5_esw_qos_vport_qos_free(vport);
-	esw_qos_put(esw);
+	mlx5_esw_qos_vport_disable_locked(vport);
 unlock:
 	esw_qos_unlock(esw);
 }
@@ -1642,6 +1653,21 @@ static bool esw_qos_tc_bw_disabled(u32 *tc_bw)
 	return true;
 }
 
+static void esw_vport_qos_prune_empty(struct mlx5_vport *vport)
+{
+	struct mlx5_esw_sched_node *vport_node = vport->qos.sched_node;
+
+	esw_assert_qos_lock_held(vport->dev->priv.eswitch);
+	if (!vport_node)
+		return;
+
+	if (vport_node->parent || vport_node->max_rate ||
+	    vport_node->min_rate || !esw_qos_tc_bw_disabled(vport_node->tc_bw))
+		return;
+
+	mlx5_esw_qos_vport_disable_locked(vport);
+}
+
 int mlx5_esw_qos_init(struct mlx5_eswitch *esw)
 {
 	if (esw->qos.domain)
@@ -1675,6 +1701,10 @@ int mlx5_esw_devlink_rate_leaf_tx_share_set(struct devlink_rate *rate_leaf, void
 
 	esw_qos_lock(esw);
 	err = mlx5_esw_qos_set_vport_min_rate(vport, tx_share, extack);
+	if (err)
+		goto out;
+	esw_vport_qos_prune_empty(vport);
+out:
 	esw_qos_unlock(esw);
 	return err;
 }
@@ -1696,6 +1726,10 @@ int mlx5_esw_devlink_rate_leaf_tx_max_set(struct devlink_rate *rate_leaf, void *
 
 	esw_qos_lock(esw);
 	err = mlx5_esw_qos_set_vport_max_rate(vport, tx_max, extack);
+	if (err)
+		goto out;
+	esw_vport_qos_prune_empty(vport);
+out:
 	esw_qos_unlock(esw);
 	return err;
 }
@@ -1733,6 +1767,7 @@ int mlx5_esw_devlink_rate_leaf_tc_bw_set(struct devlink_rate *rate_leaf,
 		if (vport_node->type == SCHED_NODE_TYPE_TC_ARBITER_TSAR)
 			err = esw_qos_vport_update(vport, SCHED_NODE_TYPE_VPORT,
 						   vport_node->parent, extack);
+		esw_vport_qos_prune_empty(vport);
 		goto unlock;
 	}
 
@@ -1893,14 +1928,20 @@ int mlx5_esw_devlink_rate_leaf_parent_set(struct devlink_rate *devlink_rate,
 					  void *priv, void *parent_priv,
 					  struct netlink_ext_ack *extack)
 {
-	struct mlx5_esw_sched_node *node;
+	struct mlx5_esw_sched_node *node = parent ? parent_priv : NULL;
 	struct mlx5_vport *vport = priv;
+	int err;
 
-	if (!parent)
-		return mlx5_esw_qos_vport_update_parent(vport, NULL, extack);
+	err = mlx5_esw_qos_vport_update_parent(vport, node, extack);
+	if (!err) {
+		struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
+
+		esw_qos_lock(esw);
+		esw_vport_qos_prune_empty(vport);
+		esw_qos_unlock(esw);
+	}
 
-	node = parent_priv;
-	return mlx5_esw_qos_vport_update_parent(vport, node, extack);
+	return err;
 }
 
 static bool esw_qos_is_node_empty(struct mlx5_esw_sched_node *node)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH V2 net 5/8] net/mlx5: Fix QoS reference leak in vport enable error path
  2025-08-20 13:32 [PATCH V2 net 0/8] mlx5 misx fixes 2025-08-20 Mark Bloch
                   ` (3 preceding siblings ...)
  2025-08-20 13:32 ` [PATCH V2 net 4/8] net/mlx5: Destroy vport QoS element when no configuration remains Mark Bloch
@ 2025-08-20 13:32 ` Mark Bloch
  2025-08-20 13:32 ` [PATCH V2 net 6/8] net/mlx5: Restore missing scheduling node cleanup on vport enable failure Mark Bloch
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Mark Bloch @ 2025-08-20 13:32 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller, przemyslaw.kitszel
  Cc: Tariq Toukan, Leon Romanovsky, Saeed Mahameed, netdev, linux-rdma,
	linux-kernel, Gal Pressman, Carolina Jubran, Cosmin Ratiu,
	Mark Bloch

From: Carolina Jubran <cjubran@nvidia.com>

Add missing esw_qos_put() call when __esw_qos_alloc_node() fails in
mlx5_esw_qos_vport_enable().

Fixes: be034baba83e ("net/mlx5: Make vport QoS enablement more flexible for future extensions")
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
index 452a948a3e6d..41aec07bb6c2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
@@ -1109,8 +1109,10 @@ static int mlx5_esw_qos_vport_enable(struct mlx5_vport *vport, enum sched_node_t
 
 	parent_esw = parent ? parent->esw : esw;
 	sched_node = __esw_qos_alloc_node(parent_esw, 0, type, parent);
-	if (!sched_node)
+	if (!sched_node) {
+		esw_qos_put(esw);
 		return -ENOMEM;
+	}
 	if (!parent)
 		list_add_tail(&sched_node->entry, &esw->qos.domain->nodes);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH V2 net 6/8] net/mlx5: Restore missing scheduling node cleanup on vport enable failure
  2025-08-20 13:32 [PATCH V2 net 0/8] mlx5 misx fixes 2025-08-20 Mark Bloch
                   ` (4 preceding siblings ...)
  2025-08-20 13:32 ` [PATCH V2 net 5/8] net/mlx5: Fix QoS reference leak in vport enable error path Mark Bloch
@ 2025-08-20 13:32 ` Mark Bloch
  2025-08-20 13:32 ` [PATCH V2 net 7/8] net/mlx5e: Query FW for buffer ownership Mark Bloch
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Mark Bloch @ 2025-08-20 13:32 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller, przemyslaw.kitszel
  Cc: Tariq Toukan, Leon Romanovsky, Saeed Mahameed, netdev, linux-rdma,
	linux-kernel, Gal Pressman, Carolina Jubran, Cosmin Ratiu,
	Mark Bloch

From: Carolina Jubran <cjubran@nvidia.com>

Restore the __esw_qos_free_node() call removed by the offending commit.

Fixes: 97733d1e00a0 ("net/mlx5: Add traffic class scheduling support for vport QoS")
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
index 41aec07bb6c2..8b4977650183 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
@@ -1122,6 +1122,7 @@ static int mlx5_esw_qos_vport_enable(struct mlx5_vport *vport, enum sched_node_t
 	vport->qos.sched_node = sched_node;
 	err = esw_qos_vport_enable(vport, type, parent, extack);
 	if (err) {
+		__esw_qos_free_node(sched_node);
 		esw_qos_put(esw);
 		vport->qos.sched_node = NULL;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH V2 net 7/8] net/mlx5e: Query FW for buffer ownership
  2025-08-20 13:32 [PATCH V2 net 0/8] mlx5 misx fixes 2025-08-20 Mark Bloch
                   ` (5 preceding siblings ...)
  2025-08-20 13:32 ` [PATCH V2 net 6/8] net/mlx5: Restore missing scheduling node cleanup on vport enable failure Mark Bloch
@ 2025-08-20 13:32 ` Mark Bloch
  2025-08-20 13:32 ` [PATCH V2 net 8/8] net/mlx5e: Preserve shared buffer capacity during headroom updates Mark Bloch
  2025-08-21 15:20 ` [PATCH V2 net 0/8] mlx5 misx fixes 2025-08-20 patchwork-bot+netdevbpf
  8 siblings, 0 replies; 13+ messages in thread
From: Mark Bloch @ 2025-08-20 13:32 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller, przemyslaw.kitszel
  Cc: Tariq Toukan, Leon Romanovsky, Saeed Mahameed, netdev, linux-rdma,
	linux-kernel, Gal Pressman, Alexei Lazar, Shahar Shitrit,
	Dragos Tatulea, Mark Bloch, Huy Nguyen, Parav Pandit,
	Saeed Mahameed

From: Alexei Lazar <alazar@nvidia.com>

The SW currently saves local buffer ownership when setting
the buffer.
This means that the SW assumes it has ownership of the buffer
after the command is set.

If setting the buffer fails and we remain in FW ownership,
the local buffer ownership state incorrectly remains as SW-owned.
This leads to incorrect behavior in subsequent PFC commands,
causing failures.

Instead of saving local buffer ownership in SW,
query the FW for buffer ownership when setting the buffer.
This ensures that the buffer ownership state is accurately
reflected, avoiding the issues caused by incorrect ownership
states.

Fixes: ecdf2dadee8e ("net/mlx5e: Receive buffer support for DCBX")
Signed-off-by: Alexei Lazar <alazar@nvidia.com>
Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
---
 .../ethernet/mellanox/mlx5/core/en/dcbnl.h    |  1 -
 .../ethernet/mellanox/mlx5/core/en_dcbnl.c    | 12 ++++++++---
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  2 ++
 .../net/ethernet/mellanox/mlx5/core/port.c    | 20 +++++++++++++++++++
 4 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/dcbnl.h b/drivers/net/ethernet/mellanox/mlx5/core/en/dcbnl.h
index b59aee75de94..2c98a5299df3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/dcbnl.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/dcbnl.h
@@ -26,7 +26,6 @@ struct mlx5e_dcbx {
 	u8                         cap;
 
 	/* Buffer configuration */
-	bool                       manual_buffer;
 	u32                        cable_len;
 	u32                        xoff;
 	u16                        port_buff_cell_sz;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c b/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c
index 5fe016e477b3..d166c0d5189e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c
@@ -362,6 +362,7 @@ static int mlx5e_dcbnl_ieee_getpfc(struct net_device *dev,
 static int mlx5e_dcbnl_ieee_setpfc(struct net_device *dev,
 				   struct ieee_pfc *pfc)
 {
+	u8 buffer_ownership = MLX5_BUF_OWNERSHIP_UNKNOWN;
 	struct mlx5e_priv *priv = netdev_priv(dev);
 	struct mlx5_core_dev *mdev = priv->mdev;
 	u32 old_cable_len = priv->dcbx.cable_len;
@@ -389,7 +390,14 @@ static int mlx5e_dcbnl_ieee_setpfc(struct net_device *dev,
 
 	if (MLX5_BUFFER_SUPPORTED(mdev)) {
 		pfc_new.pfc_en = (changed & MLX5E_PORT_BUFFER_PFC) ? pfc->pfc_en : curr_pfc_en;
-		if (priv->dcbx.manual_buffer)
+		ret = mlx5_query_port_buffer_ownership(mdev,
+						       &buffer_ownership);
+		if (ret)
+			netdev_err(dev,
+				   "%s, Failed to get buffer ownership: %d\n",
+				   __func__, ret);
+
+		if (buffer_ownership == MLX5_BUF_OWNERSHIP_SW_OWNED)
 			ret = mlx5e_port_manual_buffer_config(priv, changed,
 							      dev->mtu, &pfc_new,
 							      NULL, NULL);
@@ -982,7 +990,6 @@ static int mlx5e_dcbnl_setbuffer(struct net_device *dev,
 	if (!changed)
 		return 0;
 
-	priv->dcbx.manual_buffer = true;
 	err = mlx5e_port_manual_buffer_config(priv, changed, dev->mtu, NULL,
 					      buffer_size, prio2buffer);
 	return err;
@@ -1252,7 +1259,6 @@ void mlx5e_dcbnl_initialize(struct mlx5e_priv *priv)
 		priv->dcbx.cap |= DCB_CAP_DCBX_HOST;
 
 	priv->dcbx.port_buff_cell_sz = mlx5e_query_port_buffers_cell_size(priv);
-	priv->dcbx.manual_buffer = false;
 	priv->dcbx.cable_len = MLX5E_DEFAULT_CABLE_LEN;
 
 	mlx5e_ets_init(priv);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index b6d53db27cd5..9d3504f5abfa 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -367,6 +367,8 @@ int mlx5_query_port_dcbx_param(struct mlx5_core_dev *mdev, u32 *out);
 int mlx5_set_port_dcbx_param(struct mlx5_core_dev *mdev, u32 *in);
 int mlx5_set_trust_state(struct mlx5_core_dev *mdev, u8 trust_state);
 int mlx5_query_trust_state(struct mlx5_core_dev *mdev, u8 *trust_state);
+int mlx5_query_port_buffer_ownership(struct mlx5_core_dev *mdev,
+				     u8 *buffer_ownership);
 int mlx5_set_dscp2prio(struct mlx5_core_dev *mdev, u8 dscp, u8 prio);
 int mlx5_query_dscp2prio(struct mlx5_core_dev *mdev, u8 *dscp2prio);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/port.c b/drivers/net/ethernet/mellanox/mlx5/core/port.c
index 549f1066d2a5..2d7adf7444ba 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/port.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/port.c
@@ -968,6 +968,26 @@ int mlx5_query_trust_state(struct mlx5_core_dev *mdev, u8 *trust_state)
 	return err;
 }
 
+int mlx5_query_port_buffer_ownership(struct mlx5_core_dev *mdev,
+				     u8 *buffer_ownership)
+{
+	u32 out[MLX5_ST_SZ_DW(pfcc_reg)] = {};
+	int err;
+
+	if (!MLX5_CAP_PCAM_FEATURE(mdev, buffer_ownership)) {
+		*buffer_ownership = MLX5_BUF_OWNERSHIP_UNKNOWN;
+		return 0;
+	}
+
+	err = mlx5_query_pfcc_reg(mdev, out, sizeof(out));
+	if (err)
+		return err;
+
+	*buffer_ownership = MLX5_GET(pfcc_reg, out, buf_ownership);
+
+	return 0;
+}
+
 int mlx5_set_dscp2prio(struct mlx5_core_dev *mdev, u8 dscp, u8 prio)
 {
 	int sz = MLX5_ST_SZ_BYTES(qpdpm_reg);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH V2 net 8/8] net/mlx5e: Preserve shared buffer capacity during headroom updates
  2025-08-20 13:32 [PATCH V2 net 0/8] mlx5 misx fixes 2025-08-20 Mark Bloch
                   ` (6 preceding siblings ...)
  2025-08-20 13:32 ` [PATCH V2 net 7/8] net/mlx5e: Query FW for buffer ownership Mark Bloch
@ 2025-08-20 13:32 ` Mark Bloch
  2025-08-20 14:50   ` Przemek Kitszel
  2025-08-21 15:20 ` [PATCH V2 net 0/8] mlx5 misx fixes 2025-08-20 patchwork-bot+netdevbpf
  8 siblings, 1 reply; 13+ messages in thread
From: Mark Bloch @ 2025-08-20 13:32 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller, przemyslaw.kitszel
  Cc: Tariq Toukan, Leon Romanovsky, Saeed Mahameed, netdev, linux-rdma,
	linux-kernel, Gal Pressman, Armen Ratner, Maher Sanalla,
	Alexei Lazar, Mark Bloch, Moshe Shemesh

From: Armen Ratner <armeng@nvidia.com>

When port buffer headroom changes, port_update_shared_buffer()
recalculates the shared buffer size and splits it in a 3:1 ratio
(lossy:lossless) - Currently, the calculation is:
lossless = shared / 4;
lossy = (shared / 4) * 3;

Meaning, the calculation dropped the remainder of shared % 4 due to
integer division, unintentionally reducing the total shared buffer
by up to three cells on each update. Over time, this could shrink
the buffer below usable size.

Fix it by changing the calculation to:
lossless = shared / 4;
lossy = shared - lossless;

This retains all buffer cells while still approximating the
intended 3:1 split, preventing capacity loss over time.

While at it, perform headroom calculations in units of cells rather than
in bytes for more accurate calculations avoiding extra divisions.

Fixes: a440030d8946 ("net/mlx5e: Update shared buffer along with device buffer changes")
Signed-off-by: Armen Ratner <armeng@nvidia.com>
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Alexei Lazar <alazar@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
---
 .../mellanox/mlx5/core/en/port_buffer.c        | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/port_buffer.c b/drivers/net/ethernet/mellanox/mlx5/core/en/port_buffer.c
index 5ae787656a7c..3efa8bf1d14e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/port_buffer.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/port_buffer.c
@@ -272,8 +272,8 @@ static int port_update_shared_buffer(struct mlx5_core_dev *mdev,
 	/* Total shared buffer size is split in a ratio of 3:1 between
 	 * lossy and lossless pools respectively.
 	 */
-	lossy_epool_size = (shared_buffer_size / 4) * 3;
 	lossless_ipool_size = shared_buffer_size / 4;
+	lossy_epool_size    = shared_buffer_size - lossless_ipool_size;
 
 	mlx5e_port_set_sbpr(mdev, 0, MLX5_EGRESS_DIR, MLX5_LOSSY_POOL, 0,
 			    lossy_epool_size);
@@ -288,14 +288,12 @@ static int port_set_buffer(struct mlx5e_priv *priv,
 	u16 port_buff_cell_sz = priv->dcbx.port_buff_cell_sz;
 	struct mlx5_core_dev *mdev = priv->mdev;
 	int sz = MLX5_ST_SZ_BYTES(pbmc_reg);
-	u32 new_headroom_size = 0;
-	u32 current_headroom_size;
+	u32 current_headroom_cells = 0;
+	u32 new_headroom_cells = 0;
 	void *in;
 	int err;
 	int i;
 
-	current_headroom_size = port_buffer->headroom_size;
-
 	in = kzalloc(sz, GFP_KERNEL);
 	if (!in)
 		return -ENOMEM;
@@ -306,12 +304,14 @@ static int port_set_buffer(struct mlx5e_priv *priv,
 
 	for (i = 0; i < MLX5E_MAX_NETWORK_BUFFER; i++) {
 		void *buffer = MLX5_ADDR_OF(pbmc_reg, in, buffer[i]);
+		current_headroom_cells += MLX5_GET(bufferx_reg, buffer, size);
+
 		u64 size = port_buffer->buffer[i].size;
 		u64 xoff = port_buffer->buffer[i].xoff;
 		u64 xon = port_buffer->buffer[i].xon;
 
-		new_headroom_size += size;
 		do_div(size, port_buff_cell_sz);
+		new_headroom_cells += size;
 		do_div(xoff, port_buff_cell_sz);
 		do_div(xon, port_buff_cell_sz);
 		MLX5_SET(bufferx_reg, buffer, size, size);
@@ -320,10 +320,8 @@ static int port_set_buffer(struct mlx5e_priv *priv,
 		MLX5_SET(bufferx_reg, buffer, xon_threshold, xon);
 	}
 
-	new_headroom_size /= port_buff_cell_sz;
-	current_headroom_size /= port_buff_cell_sz;
-	err = port_update_shared_buffer(priv->mdev, current_headroom_size,
-					new_headroom_size);
+	err = port_update_shared_buffer(priv->mdev, current_headroom_cells,
+					new_headroom_cells);
 	if (err)
 		goto out;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH V2 net 4/8] net/mlx5: Destroy vport QoS element when no configuration remains
  2025-08-20 13:32 ` [PATCH V2 net 4/8] net/mlx5: Destroy vport QoS element when no configuration remains Mark Bloch
@ 2025-08-20 14:45   ` Przemek Kitszel
  0 siblings, 0 replies; 13+ messages in thread
From: Przemek Kitszel @ 2025-08-20 14:45 UTC (permalink / raw)
  To: Mark Bloch, Carolina Jubran
  Cc: Tariq Toukan, Leon Romanovsky, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Andrew Lunn, David S. Miller, Saeed Mahameed, netdev,
	linux-rdma, linux-kernel, Gal Pressman, Cosmin Ratiu,
	Mohamad Haj Yahia, Saeed Mahameed

On 8/20/25 15:32, Mark Bloch wrote:
> From: Carolina Jubran <cjubran@nvidia.com>
> 
> If a VF has been configured and the user later clears all QoS settings,
> the vport element remains in the firmware QoS tree. This leads to
> inconsistent behavior compared to VFs that were never configured, since
> the FW assumes that unconfigured VFs are outside the QoS hierarchy.
> As a result, the bandwidth share across VFs may differ, even though
> none of them appear to have any configuration.
> 
> Align the driver behavior with the FW expectation by destroying the
> vport QoS element when all configurations are removed.
> 
> Fixes: c9497c98901c ("net/mlx5: Add support for setting VF min rate")
> Fixes: cf7e73770d1b ("net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw")
> Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
> Signed-off-by: Mark Bloch <mbloch@nvidia.com>
> ---
>   .../net/ethernet/mellanox/mlx5/core/esw/qos.c | 57 ++++++++++++++++---
>   1 file changed, 49 insertions(+), 8 deletions(-)
this is much better, thank you!
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2 net 8/8] net/mlx5e: Preserve shared buffer capacity during headroom updates
  2025-08-20 13:32 ` [PATCH V2 net 8/8] net/mlx5e: Preserve shared buffer capacity during headroom updates Mark Bloch
@ 2025-08-20 14:50   ` Przemek Kitszel
  0 siblings, 0 replies; 13+ messages in thread
From: Przemek Kitszel @ 2025-08-20 14:50 UTC (permalink / raw)
  To: Armen Ratner, Mark Bloch
  Cc: Tariq Toukan, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Lunn, David S. Miller, Leon Romanovsky, Saeed Mahameed,
	netdev, linux-rdma, linux-kernel, Gal Pressman, Maher Sanalla,
	Alexei Lazar, Moshe Shemesh

On 8/20/25 15:32, Mark Bloch wrote:
> From: Armen Ratner <armeng@nvidia.com>
> 
> When port buffer headroom changes, port_update_shared_buffer()
> recalculates the shared buffer size and splits it in a 3:1 ratio
> (lossy:lossless) - Currently, the calculation is:
> lossless = shared / 4;
> lossy = (shared / 4) * 3;
> 
> Meaning, the calculation dropped the remainder of shared % 4 due to
> integer division, unintentionally reducing the total shared buffer
> by up to three cells on each update. Over time, this could shrink
> the buffer below usable size.
> 
> Fix it by changing the calculation to:
> lossless = shared / 4;
> lossy = shared - lossless;
> 
> This retains all buffer cells while still approximating the
> intended 3:1 split, preventing capacity loss over time.

very nice fix and a good story :)
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>

> 
> While at it, perform headroom calculations in units of cells rather than
> in bytes for more accurate calculations avoiding extra divisions.

nit: next time I would split that into two commits

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2 net 0/8] mlx5 misx fixes 2025-08-20
  2025-08-20 13:32 [PATCH V2 net 0/8] mlx5 misx fixes 2025-08-20 Mark Bloch
                   ` (7 preceding siblings ...)
  2025-08-20 13:32 ` [PATCH V2 net 8/8] net/mlx5e: Preserve shared buffer capacity during headroom updates Mark Bloch
@ 2025-08-21 15:20 ` patchwork-bot+netdevbpf
  8 siblings, 0 replies; 13+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-08-21 15:20 UTC (permalink / raw)
  To: Mark Bloch
  Cc: edumazet, kuba, pabeni, andrew+netdev, davem, przemyslaw.kitszel,
	tariqt, leon, saeedm, netdev, linux-rdma, linux-kernel, gal

Hello:

This series was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 20 Aug 2025 16:32:01 +0300 you wrote:
> Hi,
> 
> This patchset provides misc bug fixes from the team to the mlx5 core and
> Eth drivers.
> 
> v1: https://lore.kernel.org/all/1755095476-414026-1-git-send-email-tariqt@nvidia.com/
> 
> [...]

Here is the summary with links:
  - [V2,net,1/8] net/mlx5: Base ECVF devlink port attrs from 0
    https://git.kernel.org/netdev/net/c/bc17455bc843
  - [V2,net,2/8] net/mlx5: Remove default QoS group and attach vports directly to root TSAR
    https://git.kernel.org/netdev/net/c/330f0f6713a3
  - [V2,net,3/8] net/mlx5e: Preserve tc-bw during parent changes
    https://git.kernel.org/netdev/net/c/e8f973576ca5
  - [V2,net,4/8] net/mlx5: Destroy vport QoS element when no configuration remains
    https://git.kernel.org/netdev/net/c/b697ef4d1d13
  - [V2,net,5/8] net/mlx5: Fix QoS reference leak in vport enable error path
    https://git.kernel.org/netdev/net/c/3c114fb2afe4
  - [V2,net,6/8] net/mlx5: Restore missing scheduling node cleanup on vport enable failure
    https://git.kernel.org/netdev/net/c/51b17c98e3db
  - [V2,net,7/8] net/mlx5e: Query FW for buffer ownership
    https://git.kernel.org/netdev/net/c/451d2849ea66
  - [V2,net,8/8] net/mlx5e: Preserve shared buffer capacity during headroom updates
    https://git.kernel.org/netdev/net/c/8b0587a885fd

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2 net 3/8] net/mlx5e: Preserve tc-bw during parent changes
  2025-08-20 13:32 ` [PATCH V2 net 3/8] net/mlx5e: Preserve tc-bw during parent changes Mark Bloch
@ 2025-10-02 16:31   ` ChaosEsque Team
  0 siblings, 0 replies; 13+ messages in thread
From: ChaosEsque Team @ 2025-10-02 16:31 UTC (permalink / raw)
  To: Mark Bloch
  Cc: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller, przemyslaw.kitszel, Tariq Toukan,
	Leon Romanovsky, Saeed Mahameed, netdev, linux-rdma, linux-kernel,
	Gal Pressman, Carolina Jubran, Cosmin Ratiu

Mark Bloch.
Finally a normal name on this list.

(the rest is "Massive Dong Wang", and similar)

On Wed, Aug 20, 2025 at 10:03 AM Mark Bloch <mbloch@nvidia.com> wrote:
>
> From: Carolina Jubran <cjubran@nvidia.com>
>
> When changing parent of a node/leaf with tc-bw configured, the code
> saves and restores tc-bw values. However, it was reading the converted
> hardware bw_share values (where 0 becomes 1) instead of the original
> user values, causing incorrect tc-bw calculations after parent change.
>
> Store original tc-bw values in the node structure and use them directly
> for save/restore operations.
>
> Fixes: cf7e73770d1b ("net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw")
> Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
> Signed-off-by: Mark Bloch <mbloch@nvidia.com>
> ---
>  .../net/ethernet/mellanox/mlx5/core/esw/qos.c | 24 +++++++++----------
>  1 file changed, 12 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
> index cd58d3934596..4ed5968f1638 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
> @@ -102,6 +102,8 @@ struct mlx5_esw_sched_node {
>         u8 level;
>         /* Valid only when this node represents a traffic class. */
>         u8 tc;
> +       /* Valid only for a TC arbiter node or vport TC arbiter. */
> +       u32 tc_bw[DEVLINK_RATE_TCS_MAX];
>  };
>
>  static void esw_qos_node_attach_to_parent(struct mlx5_esw_sched_node *node)
> @@ -609,10 +611,7 @@ static void
>  esw_qos_tc_arbiter_get_bw_shares(struct mlx5_esw_sched_node *tc_arbiter_node,
>                                  u32 *tc_bw)
>  {
> -       struct mlx5_esw_sched_node *vports_tc_node;
> -
> -       list_for_each_entry(vports_tc_node, &tc_arbiter_node->children, entry)
> -               tc_bw[vports_tc_node->tc] = vports_tc_node->bw_share;
> +       memcpy(tc_bw, tc_arbiter_node->tc_bw, sizeof(tc_arbiter_node->tc_bw));
>  }
>
>  static void
> @@ -629,6 +628,7 @@ esw_qos_set_tc_arbiter_bw_shares(struct mlx5_esw_sched_node *tc_arbiter_node,
>                 u8 tc = vports_tc_node->tc;
>                 u32 bw_share;
>
> +               tc_arbiter_node->tc_bw[tc] = tc_bw[tc];
>                 bw_share = tc_bw[tc] * fw_max_bw_share;
>                 bw_share = esw_qos_calc_bw_share(bw_share, divider,
>                                                  fw_max_bw_share);
> @@ -1060,6 +1060,7 @@ static void esw_qos_vport_disable(struct mlx5_vport *vport, struct netlink_ext_a
>                 esw_qos_vport_tc_disable(vport, extack);
>
>         vport_node->bw_share = 0;
> +       memset(vport_node->tc_bw, 0, sizeof(vport_node->tc_bw));
>         list_del_init(&vport_node->entry);
>         esw_qos_normalize_min_rate(vport_node->esw, vport_node->parent, extack);
>
> @@ -1231,8 +1232,9 @@ static int esw_qos_vport_update(struct mlx5_vport *vport,
>                                 struct mlx5_esw_sched_node *parent,
>                                 struct netlink_ext_ack *extack)
>  {
> -       struct mlx5_esw_sched_node *curr_parent = vport->qos.sched_node->parent;
> -       enum sched_node_type curr_type = vport->qos.sched_node->type;
> +       struct mlx5_esw_sched_node *vport_node = vport->qos.sched_node;
> +       struct mlx5_esw_sched_node *curr_parent = vport_node->parent;
> +       enum sched_node_type curr_type = vport_node->type;
>         u32 curr_tc_bw[DEVLINK_RATE_TCS_MAX] = {0};
>         int err;
>
> @@ -1244,10 +1246,8 @@ static int esw_qos_vport_update(struct mlx5_vport *vport,
>         if (err)
>                 return err;
>
> -       if (curr_type == SCHED_NODE_TYPE_TC_ARBITER_TSAR && curr_type == type) {
> -               esw_qos_tc_arbiter_get_bw_shares(vport->qos.sched_node,
> -                                                curr_tc_bw);
> -       }
> +       if (curr_type == SCHED_NODE_TYPE_TC_ARBITER_TSAR && curr_type == type)
> +               esw_qos_tc_arbiter_get_bw_shares(vport_node, curr_tc_bw);
>
>         esw_qos_vport_disable(vport, extack);
>
> @@ -1258,8 +1258,8 @@ static int esw_qos_vport_update(struct mlx5_vport *vport,
>         }
>
>         if (curr_type == SCHED_NODE_TYPE_TC_ARBITER_TSAR && curr_type == type) {
> -               esw_qos_set_tc_arbiter_bw_shares(vport->qos.sched_node,
> -                                                curr_tc_bw, extack);
> +               esw_qos_set_tc_arbiter_bw_shares(vport_node, curr_tc_bw,
> +                                                extack);
>         }
>
>         return err;
> --
> 2.34.1
>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-10-02 16:26 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-20 13:32 [PATCH V2 net 0/8] mlx5 misx fixes 2025-08-20 Mark Bloch
2025-08-20 13:32 ` [PATCH V2 net 1/8] net/mlx5: Base ECVF devlink port attrs from 0 Mark Bloch
2025-08-20 13:32 ` [PATCH V2 net 2/8] net/mlx5: Remove default QoS group and attach vports directly to root TSAR Mark Bloch
2025-08-20 13:32 ` [PATCH V2 net 3/8] net/mlx5e: Preserve tc-bw during parent changes Mark Bloch
2025-10-02 16:31   ` ChaosEsque Team
2025-08-20 13:32 ` [PATCH V2 net 4/8] net/mlx5: Destroy vport QoS element when no configuration remains Mark Bloch
2025-08-20 14:45   ` Przemek Kitszel
2025-08-20 13:32 ` [PATCH V2 net 5/8] net/mlx5: Fix QoS reference leak in vport enable error path Mark Bloch
2025-08-20 13:32 ` [PATCH V2 net 6/8] net/mlx5: Restore missing scheduling node cleanup on vport enable failure Mark Bloch
2025-08-20 13:32 ` [PATCH V2 net 7/8] net/mlx5e: Query FW for buffer ownership Mark Bloch
2025-08-20 13:32 ` [PATCH V2 net 8/8] net/mlx5e: Preserve shared buffer capacity during headroom updates Mark Bloch
2025-08-20 14:50   ` Przemek Kitszel
2025-08-21 15:20 ` [PATCH V2 net 0/8] mlx5 misx fixes 2025-08-20 patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox