[PATCH iwl-net 0/3] Fix XDP loading on machines with many CPUs

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH iwl-net 0/3] Fix XDP loading on machines with many CPUs
@ 2025-04-22 15:36 Michal Kubiak
  2025-04-22 15:36 ` [PATCH iwl-net 1/3] ice: fix Tx scheduler error handling in XDP callback Michal Kubiak
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Michal Kubiak @ 2025-04-22 15:36 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: maciej.fijalkowski, aleksander.lobakin, przemyslaw.kitszel,
	dawid.osuchowski, jacob.e.keller, netdev, Michal Kubiak

Hi,

Some of our customers have reported a crash problem when trying to load
the XDP program on machines with a large number of CPU cores. After
extensive debugging, it became clear that the root cause of the problem
lies in the Tx scheduler implementation, which does not seem to be able
to handle the creation of a large number of Tx queues (even though this
number does not exceed the number of available queues reported by the
FW).
This series addresses this problem.

First of all, the XDP callback should not crash even if the Tx scheduler
returns an error, so Patch #1 fixes this error handling and makes the
XDP callback fail gracefully.
Patch #2 fixes the problem where the Tx scheduler tries to create too
many nodes even though some of them have already been added to the
scheduler tree.
Finally, Patch #3 implements an improvement to the Tx scheduler tree
rebuild algorithm to add another VSI support node if it is necessary to
support all requested Tx rings.

As testing hints, I include sample failure scenarios below:
  1) Number of LAN Tx/Rx queue pairs: 128
     Number of requested XDP queues: >= 321 and <= 640
     Error message:
        Failed to set LAN Tx queue context, error: -22
  2) Number of LAN Tx/Rx queue pairs: 128
     Number of requested XDP queues: >= 641
     Error message:
        Failed VSI LAN queue config for XDP, error: -5

Thanks,
Michal

Michal Kubiak (3):
  ice: fix Tx scheduler error handling in XDP callback
  ice: create new Tx scheduler nodes for new queues only
  ice: fix rebuilding the Tx scheduler tree for large queue counts

 drivers/net/ethernet/intel/ice/ice_main.c  |  47 +++++++---
 drivers/net/ethernet/intel/ice/ice_sched.c | 103 +++++++++++++++++++--
 2 files changed, 126 insertions(+), 24 deletions(-)

-- 
2.45.2

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH iwl-net 1/3] ice: fix Tx scheduler error handling in XDP callback
  2025-04-22 15:36 [PATCH iwl-net 0/3] Fix XDP loading on machines with many CPUs Michal Kubiak
@ 2025-04-22 15:36 ` Michal Kubiak
  2025-04-22 17:02   ` [Intel-wired-lan] " Loktionov, Aleksandr
  2025-04-22 15:36 ` [PATCH iwl-net 2/3] ice: create new Tx scheduler nodes for new queues only Michal Kubiak
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Michal Kubiak @ 2025-04-22 15:36 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: maciej.fijalkowski, aleksander.lobakin, przemyslaw.kitszel,
	dawid.osuchowski, jacob.e.keller, netdev, Michal Kubiak

When the XDP program is loaded, the XDP callback adds new Tx queues.
This means that the callback must update the Tx scheduler with the new
queue number. In the event of a Tx scheduler failure, the XDP callback
should also fail and roll back any changes previously made for XDP
preparation.

The previous implementation had a bug that not all changes made by the
XDP callback were rolled back. This caused the crash with the following
call trace:

[  +9.549584] ice 0000:ca:00.0: Failed VSI LAN queue config for XDP, error: -5
[  +0.382335] Oops: general protection fault, probably for non-canonical address 0x50a2250a90495525: 0000 [#1] SMP NOPTI
[  +0.010710] CPU: 103 UID: 0 PID: 0 Comm: swapper/103 Not tainted 6.14.0-net-next-mar-31+ #14 PREEMPT(voluntary)
[  +0.010175] Hardware name: Intel Corporation M50CYP2SBSTD/M50CYP2SBSTD, BIOS SE5C620.86B.01.01.0005.2202160810 02/16/2022
[  +0.010946] RIP: 0010:__ice_update_sample+0x39/0xe0 [ice]

[...]

[  +0.002715] Call Trace:
[  +0.002452]  <IRQ>
[  +0.002021]  ? __die_body.cold+0x19/0x29
[  +0.003922]  ? die_addr+0x3c/0x60
[  +0.003319]  ? exc_general_protection+0x17c/0x400
[  +0.004707]  ? asm_exc_general_protection+0x26/0x30
[  +0.004879]  ? __ice_update_sample+0x39/0xe0 [ice]
[  +0.004835]  ice_napi_poll+0x665/0x680 [ice]
[  +0.004320]  __napi_poll+0x28/0x190
[  +0.003500]  net_rx_action+0x198/0x360
[  +0.003752]  ? update_rq_clock+0x39/0x220
[  +0.004013]  handle_softirqs+0xf1/0x340
[  +0.003840]  ? sched_clock_cpu+0xf/0x1f0
[  +0.003925]  __irq_exit_rcu+0xc2/0xe0
[  +0.003665]  common_interrupt+0x85/0xa0
[  +0.003839]  </IRQ>
[  +0.002098]  <TASK>
[  +0.002106]  asm_common_interrupt+0x26/0x40
[  +0.004184] RIP: 0010:cpuidle_enter_state+0xd3/0x690

Fix this by performing the missing unmapping of XDP queues from
q_vectors and setting the XDP rings pointer back to NULL after all those
queues are released.
Also, add an immediate exit from the XDP callback in case of ring
preparation failure.

Fixes: efc2214b6047 ("ice: Add support for XDP")
Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_main.c | 47 ++++++++++++++++-------
 1 file changed, 33 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 8119913b5f69..34df104ac567 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -2740,6 +2740,27 @@ void ice_map_xdp_rings(struct ice_vsi *vsi)
 	}
 }
 
+/**
+ * ice_unmap_xdp_rings - Unmap XDP rings from interrupt vectors
+ * @vsi: the VSI with XDP rings being unmapped
+ */
+static void ice_unmap_xdp_rings(struct ice_vsi *vsi)
+{
+	int v_idx;
+
+	ice_for_each_q_vector(vsi, v_idx) {
+		struct ice_q_vector *q_vector = vsi->q_vectors[v_idx];
+		struct ice_tx_ring *ring;
+
+		ice_for_each_tx_ring(ring, q_vector->tx)
+			if (!ring->tx_buf || !ice_ring_is_xdp(ring))
+				break;
+
+		/* restore the value of last node prior to XDP setup */
+		q_vector->tx.tx_ring = ring;
+	}
+}
+
 /**
  * ice_prepare_xdp_rings - Allocate, configure and setup Tx rings for XDP
  * @vsi: VSI to bring up Tx rings used by XDP
@@ -2803,7 +2824,7 @@ int ice_prepare_xdp_rings(struct ice_vsi *vsi, struct bpf_prog *prog,
 	if (status) {
 		dev_err(dev, "Failed VSI LAN queue config for XDP, error: %d\n",
 			status);
-		goto clear_xdp_rings;
+		goto unmap_xdp_rings;
 	}
 
 	/* assign the prog only when it's not already present on VSI;
@@ -2819,6 +2840,8 @@ int ice_prepare_xdp_rings(struct ice_vsi *vsi, struct bpf_prog *prog,
 		ice_vsi_assign_bpf_prog(vsi, prog);
 
 	return 0;
+unmap_xdp_rings:
+	ice_unmap_xdp_rings(vsi);
 clear_xdp_rings:
 	ice_for_each_xdp_txq(vsi, i)
 		if (vsi->xdp_rings[i]) {
@@ -2835,6 +2858,8 @@ int ice_prepare_xdp_rings(struct ice_vsi *vsi, struct bpf_prog *prog,
 	mutex_unlock(&pf->avail_q_mutex);
 
 	devm_kfree(dev, vsi->xdp_rings);
+	vsi->xdp_rings = NULL;
+
 	return -ENOMEM;
 }
 
@@ -2850,7 +2875,7 @@ int ice_destroy_xdp_rings(struct ice_vsi *vsi, enum ice_xdp_cfg cfg_type)
 {
 	u16 max_txqs[ICE_MAX_TRAFFIC_CLASS] = { 0 };
 	struct ice_pf *pf = vsi->back;
-	int i, v_idx;
+	int i;
 
 	/* q_vectors are freed in reset path so there's no point in detaching
 	 * rings
@@ -2858,17 +2883,7 @@ int ice_destroy_xdp_rings(struct ice_vsi *vsi, enum ice_xdp_cfg cfg_type)
 	if (cfg_type == ICE_XDP_CFG_PART)
 		goto free_qmap;
 
-	ice_for_each_q_vector(vsi, v_idx) {
-		struct ice_q_vector *q_vector = vsi->q_vectors[v_idx];
-		struct ice_tx_ring *ring;
-
-		ice_for_each_tx_ring(ring, q_vector->tx)
-			if (!ring->tx_buf || !ice_ring_is_xdp(ring))
-				break;
-
-		/* restore the value of last node prior to XDP setup */
-		q_vector->tx.tx_ring = ring;
-	}
+	ice_unmap_xdp_rings(vsi);
 
 free_qmap:
 	mutex_lock(&pf->avail_q_mutex);
@@ -3013,11 +3028,14 @@ ice_xdp_setup_prog(struct ice_vsi *vsi, struct bpf_prog *prog,
 		xdp_ring_err = ice_vsi_determine_xdp_res(vsi);
 		if (xdp_ring_err) {
 			NL_SET_ERR_MSG_MOD(extack, "Not enough Tx resources for XDP");
+			goto resume_if;
 		} else {
 			xdp_ring_err = ice_prepare_xdp_rings(vsi, prog,
 							     ICE_XDP_CFG_FULL);
-			if (xdp_ring_err)
+			if (xdp_ring_err) {
 				NL_SET_ERR_MSG_MOD(extack, "Setting up XDP Tx resources failed");
+				goto resume_if;
+			}
 		}
 		xdp_features_set_redirect_target(vsi->netdev, true);
 		/* reallocate Rx queues that are used for zero-copy */
@@ -3035,6 +3053,7 @@ ice_xdp_setup_prog(struct ice_vsi *vsi, struct bpf_prog *prog,
 			NL_SET_ERR_MSG_MOD(extack, "Freeing XDP Rx resources failed");
 	}
 
+resume_if:
 	if (if_running)
 		ret = ice_up(vsi);
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH iwl-net 2/3] ice: create new Tx scheduler nodes for new queues only
  2025-04-22 15:36 [PATCH iwl-net 0/3] Fix XDP loading on machines with many CPUs Michal Kubiak
  2025-04-22 15:36 ` [PATCH iwl-net 1/3] ice: fix Tx scheduler error handling in XDP callback Michal Kubiak
@ 2025-04-22 15:36 ` Michal Kubiak
  2025-04-22 15:36 ` [PATCH iwl-net 3/3] ice: fix rebuilding the Tx scheduler tree for large queue counts Michal Kubiak
  2025-05-07  5:31 ` [PATCH iwl-net 0/3] Fix XDP loading on machines with many CPUs Jesse Brandeburg
  3 siblings, 0 replies; 10+ messages in thread
From: Michal Kubiak @ 2025-04-22 15:36 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: maciej.fijalkowski, aleksander.lobakin, przemyslaw.kitszel,
	dawid.osuchowski, jacob.e.keller, netdev, Michal Kubiak

The current implementation of the Tx scheduler tree attempts
to create nodes for all Tx queues, ignoring the fact that some
queues may already exist in the tree. For example, if the VSI
already has 128 Tx queues and the user requests for 16 new queues,
the Tx scheduler will compute the tree for 272 queues (128 existing
queues + 144 new queues), instead of 144 queues (128 existing queues
and 16 new queues).
Fix that by modifying the node count calculation algorithm to skip
the queues that already exist in the tree.

Fixes: 5513b920a4f7 ("ice: Update Tx scheduler tree for VSI multi-Tx queue support")
Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_sched.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_sched.c b/drivers/net/ethernet/intel/ice/ice_sched.c
index 6ca13c5dcb14..6524875b34d3 100644
--- a/drivers/net/ethernet/intel/ice/ice_sched.c
+++ b/drivers/net/ethernet/intel/ice/ice_sched.c
@@ -1604,16 +1604,16 @@ ice_sched_get_agg_node(struct ice_port_info *pi, struct ice_sched_node *tc_node,
 /**
  * ice_sched_calc_vsi_child_nodes - calculate number of VSI child nodes
  * @hw: pointer to the HW struct
- * @num_qs: number of queues
+ * @num_new_qs: number of new queues that will be added to the tree
  * @num_nodes: num nodes array
  *
  * This function calculates the number of VSI child nodes based on the
  * number of queues.
  */
 static void
-ice_sched_calc_vsi_child_nodes(struct ice_hw *hw, u16 num_qs, u16 *num_nodes)
+ice_sched_calc_vsi_child_nodes(struct ice_hw *hw, u16 num_new_qs, u16 *num_nodes)
 {
-	u16 num = num_qs;
+	u16 num = num_new_qs;
 	u8 i, qgl, vsil;
 
 	qgl = ice_sched_get_qgrp_layer(hw);
@@ -1863,8 +1863,9 @@ ice_sched_update_vsi_child_nodes(struct ice_port_info *pi, u16 vsi_handle,
 			return status;
 	}
 
-	if (new_numqs)
-		ice_sched_calc_vsi_child_nodes(hw, new_numqs, new_num_nodes);
+	ice_sched_calc_vsi_child_nodes(hw, new_numqs - prev_numqs,
+				       new_num_nodes);
+
 	/* Keep the max number of queue configuration all the time. Update the
 	 * tree only if number of queues > previous number of queues. This may
 	 * leave some extra nodes in the tree if number of queues < previous
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH iwl-net 3/3] ice: fix rebuilding the Tx scheduler tree for large queue counts
  2025-04-22 15:36 [PATCH iwl-net 0/3] Fix XDP loading on machines with many CPUs Michal Kubiak
  2025-04-22 15:36 ` [PATCH iwl-net 1/3] ice: fix Tx scheduler error handling in XDP callback Michal Kubiak
  2025-04-22 15:36 ` [PATCH iwl-net 2/3] ice: create new Tx scheduler nodes for new queues only Michal Kubiak
@ 2025-04-22 15:36 ` Michal Kubiak
  2025-05-07  5:31 ` [PATCH iwl-net 0/3] Fix XDP loading on machines with many CPUs Jesse Brandeburg
  3 siblings, 0 replies; 10+ messages in thread
From: Michal Kubiak @ 2025-04-22 15:36 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: maciej.fijalkowski, aleksander.lobakin, przemyslaw.kitszel,
	dawid.osuchowski, jacob.e.keller, netdev, Michal Kubiak

The current implementation of the Tx scheduler allows the tree to be
rebuilt as the user adds more Tx queues to the VSI. In such a case,
additional child nodes are added to the tree to support the new number
of queues.
Unfortunately, this algorithm does not take into account that the limit
of the VSI support node may be exceeded, so an additional node in the
VSI layer may be required to handle all the requested queues.

Such a scenario occurs when adding XDP Tx queues on machines with many
CPUs. Although the driver still respects the queue limit returned by
the FW, the Tx scheduler was unable to add those queues to its tree
and returned one of the errors below.

Such a scenario occurs when adding XDP Tx queues on machines with many
CPUs (e.g. at least 321 CPUs, if there is already 128 Tx/Rx queue pairs).
Although the driver still respects the queue limit returned by the FW,
the Tx scheduler was unable to add those queues to its tree and returned
the following errors:

     Failed VSI LAN queue config for XDP, error: -5
or:
     Failed to set LAN Tx queue context, error: -22

Fix this problem by extending the tree rebuild algorithm to check if the
current VSI node can support the requested number of queues. If it
cannot, create as many additional VSI support nodes as necessary to
handle all the required Tx queues.
Also, make the search for the next free VSI node more restrictive. That is,
add queue group nodes only to the VSI support nodes that have a matching
VSI handle.
Finally, fix the comment describing the tree update algorithm to better
reflect the current scenario.

Fixes: b0153fdd7e8a ("ice: update VSI config dynamically")
Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_sched.c | 92 ++++++++++++++++++++--
 1 file changed, 87 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_sched.c b/drivers/net/ethernet/intel/ice/ice_sched.c
index 6524875b34d3..9d1f8db4c503 100644
--- a/drivers/net/ethernet/intel/ice/ice_sched.c
+++ b/drivers/net/ethernet/intel/ice/ice_sched.c
@@ -84,6 +84,27 @@ ice_sched_find_node_by_teid(struct ice_sched_node *start_node, u32 teid)
 	return NULL;
 }
 
+/**
+ * ice_sched_find_next_vsi_node - find the next node for a given VSI
+ * @vsi_node: VSI support node to start search with
+ *
+ * Return: Next VSI support node, or NULL.
+ *
+ * The function returns a pointer to the next node from the VSI layer
+ * assigned to the given VSI, or NULL if there is no such a node.
+ */
+static struct ice_sched_node *
+ice_sched_find_next_vsi_node(struct ice_sched_node *vsi_node)
+{
+	unsigned int vsi_handle = vsi_node->vsi_handle;
+
+	while ((vsi_node = vsi_node->sibling) != NULL)
+		if (vsi_node->vsi_handle == vsi_handle)
+			break;
+
+	return vsi_node;
+}
+
 /**
  * ice_aqc_send_sched_elem_cmd - send scheduling elements cmd
  * @hw: pointer to the HW struct
@@ -1084,8 +1105,10 @@ ice_sched_add_nodes_to_layer(struct ice_port_info *pi,
 		if (parent->num_children < max_child_nodes) {
 			new_num_nodes = max_child_nodes - parent->num_children;
 		} else {
-			/* This parent is full, try the next sibling */
-			parent = parent->sibling;
+			/* This parent is full,
+			 * try the next available sibling.
+			 */
+			parent = ice_sched_find_next_vsi_node(parent);
 			/* Don't modify the first node TEID memory if the
 			 * first node was added already in the above call.
 			 * Instead send some temp memory for all other
@@ -1528,12 +1551,23 @@ ice_sched_get_free_qparent(struct ice_port_info *pi, u16 vsi_handle, u8 tc,
 	/* get the first queue group node from VSI sub-tree */
 	qgrp_node = ice_sched_get_first_node(pi, vsi_node, qgrp_layer);
 	while (qgrp_node) {
+		struct ice_sched_node *next_vsi_node;
+
 		/* make sure the qgroup node is part of the VSI subtree */
 		if (ice_sched_find_node_in_subtree(pi->hw, vsi_node, qgrp_node))
 			if (qgrp_node->num_children < max_children &&
 			    qgrp_node->owner == owner)
 				break;
 		qgrp_node = qgrp_node->sibling;
+		if (qgrp_node)
+			continue;
+
+		next_vsi_node = ice_sched_find_next_vsi_node(vsi_node);
+		if (!next_vsi_node)
+			break;
+
+		vsi_node = next_vsi_node;
+		qgrp_node = ice_sched_get_first_node(pi, vsi_node, qgrp_layer);
 	}
 
 	/* Select the best queue group */
@@ -1779,7 +1813,11 @@ ice_sched_add_vsi_support_nodes(struct ice_port_info *pi, u16 vsi_handle,
 		if (!parent)
 			return -EIO;
 
-		if (i == vsil)
+		/* Do not modify the VSI handle for already existing VSI nodes,
+		 * (if no new VSI node was added to the tree).
+		 * Assign the VSI handle only to newly added VSI nodes.
+		 */
+		if (i == vsil && num_added)
 			parent->vsi_handle = vsi_handle;
 	}
 
@@ -1812,6 +1850,41 @@ ice_sched_add_vsi_to_topo(struct ice_port_info *pi, u16 vsi_handle, u8 tc)
 					       num_nodes);
 }
 
+/**
+ * ice_sched_recalc_vsi_support_nodes - recalculate VSI support nodes count
+ * @hw: pointer to the HW struct
+ * @vsi_node: pointer to the leftmost VSI node that needs to be extended
+ * @new_numqs: new number of queues that has to be handled by the VSI
+ * @new_num_nodes: pointer to nodes count table to modify the VSI layer entry
+ *
+ * This function recalculates the number of supported nodes that need to
+ * be added after adding more Tx queues for a given VSI.
+ * The number of new VSI support nodes that shall be added will be saved
+ * to the @new_num_nodes table for the VSI layer.
+ */
+static void
+ice_sched_recalc_vsi_support_nodes(struct ice_hw *hw,
+				   struct ice_sched_node *vsi_node,
+				   unsigned int new_numqs, u16 *new_num_nodes)
+{
+	u32 vsi_nodes_cnt = 1;
+	u32 max_queue_cnt = 1;
+	u32 qgl, vsil;
+
+	qgl = ice_sched_get_qgrp_layer(hw);
+	vsil = ice_sched_get_vsi_layer(hw);
+
+	for (u32 i = vsil; i <= qgl; i++)
+		max_queue_cnt *= hw->max_children[i];
+
+	while ((vsi_node = ice_sched_find_next_vsi_node(vsi_node)) != NULL)
+		vsi_nodes_cnt++;
+
+	if (new_numqs > (max_queue_cnt * vsi_nodes_cnt))
+		new_num_nodes[vsil] = DIV_ROUND_UP(new_numqs, max_queue_cnt) -
+				      vsi_nodes_cnt;
+}
+
 /**
  * ice_sched_update_vsi_child_nodes - update VSI child nodes
  * @pi: port information structure
@@ -1863,16 +1936,25 @@ ice_sched_update_vsi_child_nodes(struct ice_port_info *pi, u16 vsi_handle,
 			return status;
 	}
 
+	ice_sched_recalc_vsi_support_nodes(hw, vsi_node,
+					   new_numqs, new_num_nodes);
 	ice_sched_calc_vsi_child_nodes(hw, new_numqs - prev_numqs,
 				       new_num_nodes);
 
-	/* Keep the max number of queue configuration all the time. Update the
-	 * tree only if number of queues > previous number of queues. This may
+	/* Never decrease the number of queues in the tree. Update the tree
+	 * only if number of queues > previous number of queues. This may
 	 * leave some extra nodes in the tree if number of queues < previous
 	 * number but that wouldn't harm anything. Removing those extra nodes
 	 * may complicate the code if those nodes are part of SRL or
 	 * individually rate limited.
+	 * Also, add the required VSI support nodes if the existing ones cannot
+	 * handle the requested new number of queues.
 	 */
+	status = ice_sched_add_vsi_support_nodes(pi, vsi_handle, tc_node,
+						 new_num_nodes);
+	if (status)
+		return status;
+
 	status = ice_sched_add_vsi_child_nodes(pi, vsi_handle, tc_node,
 					       new_num_nodes, owner);
 	if (status)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* RE: [Intel-wired-lan] [PATCH iwl-net 1/3] ice: fix Tx scheduler error handling in XDP callback
  2025-04-22 15:36 ` [PATCH iwl-net 1/3] ice: fix Tx scheduler error handling in XDP callback Michal Kubiak
@ 2025-04-22 17:02   ` Loktionov, Aleksandr
  0 siblings, 0 replies; 10+ messages in thread
From: Loktionov, Aleksandr @ 2025-04-22 17:02 UTC (permalink / raw)
  To: Kubiak, Michal, intel-wired-lan@lists.osuosl.org
  Cc: Fijalkowski, Maciej, Lobakin, Aleksander, Kitszel, Przemyslaw,
	dawid.osuchowski@linux.intel.com, Keller, Jacob E,
	netdev@vger.kernel.org, Kubiak, Michal



> -----Original Message-----
> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf Of
> Michal Kubiak
> Sent: Tuesday, April 22, 2025 5:37 PM
> To: intel-wired-lan@lists.osuosl.org
> Cc: Fijalkowski, Maciej <maciej.fijalkowski@intel.com>; Lobakin, Aleksander
> <aleksander.lobakin@intel.com>; Kitszel, Przemyslaw
> <przemyslaw.kitszel@intel.com>; dawid.osuchowski@linux.intel.com; Keller,
> Jacob E <jacob.e.keller@intel.com>; netdev@vger.kernel.org; Kubiak, Michal
> <michal.kubiak@intel.com>
> Subject: [Intel-wired-lan] [PATCH iwl-net 1/3] ice: fix Tx scheduler error
> handling in XDP callback
> 
> When the XDP program is loaded, the XDP callback adds new Tx queues.
> This means that the callback must update the Tx scheduler with the new
> queue number. In the event of a Tx scheduler failure, the XDP callback should
> also fail and roll back any changes previously made for XDP preparation.
> 
> The previous implementation had a bug that not all changes made by the XDP
> callback were rolled back. This caused the crash with the following call trace:
> 
> [  +9.549584] ice 0000:ca:00.0: Failed VSI LAN queue config for XDP, error: -5 [
> +0.382335] Oops: general protection fault, probably for non-canonical
> address 0x50a2250a90495525: 0000 [#1] SMP NOPTI [  +0.010710] CPU:
> 103 UID: 0 PID: 0 Comm: swapper/103 Not tainted 6.14.0-net-next-mar-31+
> #14 PREEMPT(voluntary) [  +0.010175] Hardware name: Intel Corporation
> M50CYP2SBSTD/M50CYP2SBSTD, BIOS
> SE5C620.86B.01.01.0005.2202160810 02/16/2022 [  +0.010946] RIP:
> 0010:__ice_update_sample+0x39/0xe0 [ice]
> 
> [...]
> 
> [  +0.002715] Call Trace:
> [  +0.002452]  <IRQ>
> [  +0.002021]  ? __die_body.cold+0x19/0x29 [  +0.003922]  ?
> die_addr+0x3c/0x60 [  +0.003319]  ? exc_general_protection+0x17c/0x400
> [  +0.004707]  ? asm_exc_general_protection+0x26/0x30
> [  +0.004879]  ? __ice_update_sample+0x39/0xe0 [ice] [  +0.004835]
> ice_napi_poll+0x665/0x680 [ice] [  +0.004320]  __napi_poll+0x28/0x190 [
> +0.003500]  net_rx_action+0x198/0x360 [  +0.003752]  ?
> update_rq_clock+0x39/0x220 [  +0.004013]  handle_softirqs+0xf1/0x340 [
> +0.003840]  ? sched_clock_cpu+0xf/0x1f0 [  +0.003925]
> __irq_exit_rcu+0xc2/0xe0 [  +0.003665]  common_interrupt+0x85/0xa0 [
> +0.003839]  </IRQ> [  +0.002098]  <TASK> [  +0.002106]
> asm_common_interrupt+0x26/0x40 [  +0.004184] RIP:
> 0010:cpuidle_enter_state+0xd3/0x690
> 
> Fix this by performing the missing unmapping of XDP queues from q_vectors
> and setting the XDP rings pointer back to NULL after all those queues are
> released.
> Also, add an immediate exit from the XDP callback in case of ring preparation
> failure.
> 
> Fixes: efc2214b6047 ("ice: Add support for XDP")
> Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com>
> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>

> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
> ---
>  drivers/net/ethernet/intel/ice/ice_main.c | 47 ++++++++++++++++-------
>  1 file changed, 33 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_main.c
> b/drivers/net/ethernet/intel/ice/ice_main.c
> index 8119913b5f69..34df104ac567 100644
> --- a/drivers/net/ethernet/intel/ice/ice_main.c
> +++ b/drivers/net/ethernet/intel/ice/ice_main.c
> @@ -2740,6 +2740,27 @@ void ice_map_xdp_rings(struct ice_vsi *vsi)
>  	}
>  }
> 
> +/**
> + * ice_unmap_xdp_rings - Unmap XDP rings from interrupt vectors
> + * @vsi: the VSI with XDP rings being unmapped  */ static void
> +ice_unmap_xdp_rings(struct ice_vsi *vsi) {
> +	int v_idx;
> +
> +	ice_for_each_q_vector(vsi, v_idx) {
> +		struct ice_q_vector *q_vector = vsi->q_vectors[v_idx];
> +		struct ice_tx_ring *ring;
> +
> +		ice_for_each_tx_ring(ring, q_vector->tx)
> +			if (!ring->tx_buf || !ice_ring_is_xdp(ring))
> +				break;
> +
> +		/* restore the value of last node prior to XDP setup */
> +		q_vector->tx.tx_ring = ring;
> +	}
> +}
> +
>  /**
>   * ice_prepare_xdp_rings - Allocate, configure and setup Tx rings for XDP
>   * @vsi: VSI to bring up Tx rings used by XDP @@ -2803,7 +2824,7 @@ int
> ice_prepare_xdp_rings(struct ice_vsi *vsi, struct bpf_prog *prog,
>  	if (status) {
>  		dev_err(dev, "Failed VSI LAN queue config for XDP, error:
> %d\n",
>  			status);
> -		goto clear_xdp_rings;
> +		goto unmap_xdp_rings;
>  	}
> 
>  	/* assign the prog only when it's not already present on VSI; @@ -
> 2819,6 +2840,8 @@ int ice_prepare_xdp_rings(struct ice_vsi *vsi, struct
> bpf_prog *prog,
>  		ice_vsi_assign_bpf_prog(vsi, prog);
> 
>  	return 0;
> +unmap_xdp_rings:
> +	ice_unmap_xdp_rings(vsi);
>  clear_xdp_rings:
>  	ice_for_each_xdp_txq(vsi, i)
>  		if (vsi->xdp_rings[i]) {
> @@ -2835,6 +2858,8 @@ int ice_prepare_xdp_rings(struct ice_vsi *vsi, struct
> bpf_prog *prog,
>  	mutex_unlock(&pf->avail_q_mutex);
> 
>  	devm_kfree(dev, vsi->xdp_rings);
> +	vsi->xdp_rings = NULL;
> +
>  	return -ENOMEM;
>  }
> 
> @@ -2850,7 +2875,7 @@ int ice_destroy_xdp_rings(struct ice_vsi *vsi, enum
> ice_xdp_cfg cfg_type)  {
>  	u16 max_txqs[ICE_MAX_TRAFFIC_CLASS] = { 0 };
>  	struct ice_pf *pf = vsi->back;
> -	int i, v_idx;
> +	int i;
> 
>  	/* q_vectors are freed in reset path so there's no point in detaching
>  	 * rings
> @@ -2858,17 +2883,7 @@ int ice_destroy_xdp_rings(struct ice_vsi *vsi,
> enum ice_xdp_cfg cfg_type)
>  	if (cfg_type == ICE_XDP_CFG_PART)
>  		goto free_qmap;
> 
> -	ice_for_each_q_vector(vsi, v_idx) {
> -		struct ice_q_vector *q_vector = vsi->q_vectors[v_idx];
> -		struct ice_tx_ring *ring;
> -
> -		ice_for_each_tx_ring(ring, q_vector->tx)
> -			if (!ring->tx_buf || !ice_ring_is_xdp(ring))
> -				break;
> -
> -		/* restore the value of last node prior to XDP setup */
> -		q_vector->tx.tx_ring = ring;
> -	}
> +	ice_unmap_xdp_rings(vsi);
> 
>  free_qmap:
>  	mutex_lock(&pf->avail_q_mutex);
> @@ -3013,11 +3028,14 @@ ice_xdp_setup_prog(struct ice_vsi *vsi, struct
> bpf_prog *prog,
>  		xdp_ring_err = ice_vsi_determine_xdp_res(vsi);
>  		if (xdp_ring_err) {
>  			NL_SET_ERR_MSG_MOD(extack, "Not enough Tx
> resources for XDP");
> +			goto resume_if;
>  		} else {
>  			xdp_ring_err = ice_prepare_xdp_rings(vsi, prog,
> 
> ICE_XDP_CFG_FULL);
> -			if (xdp_ring_err)
> +			if (xdp_ring_err) {
>  				NL_SET_ERR_MSG_MOD(extack, "Setting up
> XDP Tx resources failed");
> +				goto resume_if;
> +			}
>  		}
>  		xdp_features_set_redirect_target(vsi->netdev, true);
>  		/* reallocate Rx queues that are used for zero-copy */ @@ -
> 3035,6 +3053,7 @@ ice_xdp_setup_prog(struct ice_vsi *vsi, struct bpf_prog
> *prog,
>  			NL_SET_ERR_MSG_MOD(extack, "Freeing XDP Rx
> resources failed");
>  	}
> 
> +resume_if:
>  	if (if_running)
>  		ret = ice_up(vsi);
> 
> --
> 2.45.2


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH iwl-net 0/3] Fix XDP loading on machines with many CPUs
  2025-04-22 15:36 [PATCH iwl-net 0/3] Fix XDP loading on machines with many CPUs Michal Kubiak
                   ` (2 preceding siblings ...)
  2025-04-22 15:36 ` [PATCH iwl-net 3/3] ice: fix rebuilding the Tx scheduler tree for large queue counts Michal Kubiak
@ 2025-05-07  5:31 ` Jesse Brandeburg
  2025-05-07  8:00   ` Michal Kubiak
  3 siblings, 1 reply; 10+ messages in thread
From: Jesse Brandeburg @ 2025-05-07  5:31 UTC (permalink / raw)
  To: Michal Kubiak, intel-wired-lan
  Cc: maciej.fijalkowski, aleksander.lobakin, przemyslaw.kitszel,
	dawid.osuchowski, jacob.e.keller, netdev, kernel-team

On 4/22/25 8:36 AM, Michal Kubiak wrote:
> Hi,
>
> Some of our customers have reported a crash problem when trying to load
> the XDP program on machines with a large number of CPU cores. After
> extensive debugging, it became clear that the root cause of the problem
> lies in the Tx scheduler implementation, which does not seem to be able
> to handle the creation of a large number of Tx queues (even though this
> number does not exceed the number of available queues reported by the
> FW).
> This series addresses this problem.


Hi Michal,

Unfortunately this version of the series seems to reintroduce the 
original problem error: -22.

I double checked the patches, they looked like they were applied in our 
test version 2025.5.8 build which contained a 6.12.26 kernel with this 
series applied (all 3)

Our setup is saying max 252 combined queues, but running 384 CPUs by 
default, loads an XDP program, then reduces the number of queues using 
ethtool, to 192. After that we get the error -22 and link is down.

Sorry to bring some bad news, and I know it took a while, it is a bit of 
a process to test this in our lab.

The original version you had sent us was working fine when we tested it, 
so the problem seems to be between those two versions. I suppose it 
could be possible (but unlikely because I used git to apply the patches) 
that there was something wrong with the source code, but I sincerely 
doubt it as the patches had applied cleanly.

We are only able to test 6.12.y or 6.6.y stable variants of the kernel 
if you want to make a test version of a fixed series for us to try.

Thanks,

Jesse


some dmesg follows:

sudo dmesg | grep -E "ice 0000:c1:00.0|ice:"

[  20.932638] ice: Intel(R) Ethernet Connection E800 Series Linux Driver

[  20.932642] ice: Copyright (c) 2018, Intel Corporation.

[  21.259332] ice 0000:c1:00.0: DDP package does not support Tx 
scheduling layers switching feature - please update to the latest DDP 
package and try again

[  21.552597] ice 0000:c1:00.0: The DDP package was successfully loaded: 
ICE COMMS Package version 1.3.51.0

[  21.610275] ice 0000:c1:00.0: 252.048 Gb/s available PCIe bandwidth 
(16.0 GT/s PCIe x16 link)

[  21.623960] ice 0000:c1:00.0: RDMA is not supported on this device

[  21.672421] ice 0000:c1:00.0: DCB is enabled in the hardware, max 
number of TCs supported on this port are 8

[  21.705729] ice 0000:c1:00.0: FW LLDP is disabled, DCBx/LLDP in SW mode.

[  21.722873] ice 0000:c1:00.0: Commit DCB Configuration to the hardware

[  22.086346] ice 0000:c1:00.1: DDP package already present on device: 
ICE COMMS Package version 1.3.51.0

[  22.289956] ice 0000:c1:00.0 ext0: renamed from eth0

[  23.137538] ice 0000:c1:00.0 ext0: NIC Link is up 25 Gbps Full Duplex, 
Requested FEC: RS-FEC, Negotiated FEC: NONE, Autoneg Advertised: On, 
Autoneg Negotiated: False, Flow Control: None

*[ 499.643936] ice 0000:c1:00.0: Failed to set LAN Tx queue context, 
error: -22*

*
*


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH iwl-net 0/3] Fix XDP loading on machines with many CPUs
  2025-05-07  5:31 ` [PATCH iwl-net 0/3] Fix XDP loading on machines with many CPUs Jesse Brandeburg
@ 2025-05-07  8:00   ` Michal Kubiak
  2025-05-08  5:51     ` Jesse Brandeburg
  0 siblings, 1 reply; 10+ messages in thread
From: Michal Kubiak @ 2025-05-07  8:00 UTC (permalink / raw)
  To: Jesse Brandeburg
  Cc: intel-wired-lan, maciej.fijalkowski, aleksander.lobakin,
	przemyslaw.kitszel, dawid.osuchowski, jacob.e.keller, netdev,
	kernel-team

On Tue, May 06, 2025 at 10:31:59PM -0700, Jesse Brandeburg wrote:
> On 4/22/25 8:36 AM, Michal Kubiak wrote:
> > Hi,
> > 
> > Some of our customers have reported a crash problem when trying to load
> > the XDP program on machines with a large number of CPU cores. After
> > extensive debugging, it became clear that the root cause of the problem
> > lies in the Tx scheduler implementation, which does not seem to be able
> > to handle the creation of a large number of Tx queues (even though this
> > number does not exceed the number of available queues reported by the
> > FW).
> > This series addresses this problem.
> 
> 
> Hi Michal,
> 
> Unfortunately this version of the series seems to reintroduce the original
> problem error: -22.

Hi Jesse,

Thanks for testing and reporting!

I will take a look at the problem and try to reproduce it locally. I would also
have a few questions inline.

First, was your original problem not the failure with error: -5? Or did you have
both (-5 and -22), depending on the scenario/environment?
I am asking because it seems that these two errors occurred at different
initialization stages of the tx scheduler. Of course, the series
was intended to address both of these issues.

> 
> I double checked the patches, they looked like they were applied in our test
> version 2025.5.8 build which contained a 6.12.26 kernel with this series
> applied (all 3)
> 
> Our setup is saying max 252 combined queues, but running 384 CPUs by
> default, loads an XDP program, then reduces the number of queues using
> ethtool, to 192. After that we get the error -22 and link is down.
> 

To be honest, I did not test the scenario in which the number of queues is
reduced while the XDP program is running. This is the first thing I will check.

Can you please confirm that you did that step on both the current
and the draft version of the series?
It would also be interesting to check what happens if the queue number is reduced
before loading the XDP program.

> Sorry to bring some bad news, and I know it took a while, it is a bit of a
> process to test this in our lab.
> 
> The original version you had sent us was working fine when we tested it, so
> the problem seems to be between those two versions. I suppose it could be
> possible (but unlikely because I used git to apply the patches) that there
> was something wrong with the source code, but I sincerely doubt it as the
> patches had applied cleanly.

The current series contains mostly some cosmetic fixes, but the potential
regression is still possible, so I will take a look at the diff.

> 
> We are only able to test 6.12.y or 6.6.y stable variants of the kernel if
> you want to make a test version of a fixed series for us to try.
> 
> Thanks,
> 
> Jesse
> 

I will keep you updated on my findings.

Thanks,
Michal

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH iwl-net 0/3] Fix XDP loading on machines with many CPUs
  2025-05-07  8:00   ` Michal Kubiak
@ 2025-05-08  5:51     ` Jesse Brandeburg
  2025-05-08 14:29       ` Michal Kubiak
  0 siblings, 1 reply; 10+ messages in thread
From: Jesse Brandeburg @ 2025-05-08  5:51 UTC (permalink / raw)
  To: Michal Kubiak
  Cc: intel-wired-lan, maciej.fijalkowski, aleksander.lobakin,
	przemyslaw.kitszel, dawid.osuchowski, jacob.e.keller, netdev,
	kernel-team

On 5/7/25 1:00 AM, Michal Kubiak wrote:
> On Tue, May 06, 2025 at 10:31:59PM -0700, Jesse Brandeburg wrote:
>> On 4/22/25 8:36 AM, Michal Kubiak wrote:
>>> Hi,
>>>
>>> Some of our customers have reported a crash problem when trying to load
>>> the XDP program on machines with a large number of CPU cores. After
>>> extensive debugging, it became clear that the root cause of the problem
>>> lies in the Tx scheduler implementation, which does not seem to be able
>>> to handle the creation of a large number of Tx queues (even though this
>>> number does not exceed the number of available queues reported by the
>>> FW).
>>> This series addresses this problem.
>>
>> Hi Michal,
>>
>> Unfortunately this version of the series seems to reintroduce the original
>> problem error: -22.
> Hi Jesse,
>
> Thanks for testing and reporting!
>
> I will take a look at the problem and try to reproduce it locally. I would also
> have a few questions inline.
>
> First, was your original problem not the failure with error: -5? Or did you have
> both (-5 and -22), depending on the scenario/environment?
> I am asking because it seems that these two errors occurred at different
> initialization stages of the tx scheduler. Of course, the series
> was intended to address both of these issues.


We had a few issues to work through, I believe the original problem we 
had was -22 (just confirmed) with more than 320 CPUs.

>> I double checked the patches, they looked like they were applied in our test
>> version 2025.5.8 build which contained a 6.12.26 kernel with this series
>> applied (all 3)
>>
>> Our setup is saying max 252 combined queues, but running 384 CPUs by
>> default, loads an XDP program, then reduces the number of queues using
>> ethtool, to 192. After that we get the error -22 and link is down.
>>
> To be honest, I did not test the scenario in which the number of queues is
> reduced while the XDP program is running. This is the first thing I will check.

Cool, I hope it will help your repro, but see below.

> Can you please confirm that you did that step on both the current
> and the draft version of the series?
> It would also be interesting to check what happens if the queue number is reduced
> before loading the XDP program.

We noticed we had a difference in the testing of draft and current. We 
have a patch against the kernel that was helping us work around this 
issue, which looked like this:


diff --git a/drivers/net/ethernet/intel/ice/ice_irq.c 
b/drivers/net/ethernet/intel/ice/ice_irq.c
index ad82ff7d1995..622d409efbce 100644
--- a/drivers/net/ethernet/intel/ice/ice_irq.c
+++ b/drivers/net/ethernet/intel/ice/ice_irq.c
@@ -126,6 +126,10 @@ static void ice_reduce_msix_usage(struct ice_pf 
*pf, int v_remain)
      }
  }

+static int num_msix = -1;
+module_param(num_msix, int, 0644);
+MODULE_PARM_DESC(num_msix, "Default limit of MSI-X vectors for LAN");
+
  /**
   * ice_ena_msix_range - Request a range of MSIX vectors from the OS
   * @pf: board private structure
@@ -156,7 +160,16 @@ static int ice_ena_msix_range(struct ice_pf *pf)
      v_wanted = v_other;

      /* LAN traffic */
-    pf->num_lan_msix = num_cpus;
+    /* Cloudflare: allocate the msix vector count based on module param
+     * num_msix. Alternately, default to half the number of CPUs or 128,
+     * whichever is smallest, and should the number of CPUs be 2, 1, or
+     * 0, then default to 2 vectors
+     */
+    if (num_msix != -1)
+        pf->num_lan_msix = num_msix;
+    else
+        pf->num_lan_msix = min_t(u16, (num_cpus / 2) ?: 2, 128);
+
      v_wanted += pf->num_lan_msix;

      /* RDMA auxiliary driver */


The module parameter helped us limit the number of vectors, which 
allowed our machines to finish booting before your new patches were 
available.

The failure of the new patch was when this value was set to 252, and the 
"draft" patch also fails in this configuration (this is new info from today)


>> The original version you had sent us was working fine when we tested it, so
>> the problem seems to be between those two versions. I suppose it could be
So the problem is also related to the inital number of queues the driver 
starts with. When we
>> possible (but unlikely because I used git to apply the patches) that there
>> was something wrong with the source code, but I sincerely doubt it as the
>> patches had applied cleanly.
The reason it worked fine was we tested "draft" (and now the new patches 
too) with the module parameter set to 384 queues (with 384 CPUs), or 
letting it default to 128 queues, both worked with the new and old 
series. 252 seems to be some magic failure causing number with both 
patches, we don't know why.


Thanks for your patience while we worked through the testing differences 
here today. Hope this helps and let me know if you have more questions.


Jesse


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH iwl-net 0/3] Fix XDP loading on machines with many CPUs
  2025-05-08  5:51     ` Jesse Brandeburg
@ 2025-05-08 14:29       ` Michal Kubiak
  2025-05-09 10:07         ` Michal Kubiak
  0 siblings, 1 reply; 10+ messages in thread
From: Michal Kubiak @ 2025-05-08 14:29 UTC (permalink / raw)
  To: Jesse Brandeburg
  Cc: intel-wired-lan, maciej.fijalkowski, aleksander.lobakin,
	przemyslaw.kitszel, dawid.osuchowski, jacob.e.keller, netdev,
	kernel-team

On Wed, May 07, 2025 at 10:51:47PM -0700, Jesse Brandeburg wrote:
> On 5/7/25 1:00 AM, Michal Kubiak wrote:
> > On Tue, May 06, 2025 at 10:31:59PM -0700, Jesse Brandeburg wrote:
> > > On 4/22/25 8:36 AM, Michal Kubiak wrote:
> > > > Hi,
> > > > 
> > > > Some of our customers have reported a crash problem when trying to load
> > > > the XDP program on machines with a large number of CPU cores. After
> > > > extensive debugging, it became clear that the root cause of the problem
> > > > lies in the Tx scheduler implementation, which does not seem to be able
> > > > to handle the creation of a large number of Tx queues (even though this
> > > > number does not exceed the number of available queues reported by the
> > > > FW).
> > > > This series addresses this problem.
> > > 
> > > Hi Michal,
> > > 
> > > Unfortunately this version of the series seems to reintroduce the original
> > > problem error: -22.
> > Hi Jesse,
> > 
> > Thanks for testing and reporting!
> > 
> > I will take a look at the problem and try to reproduce it locally. I would also
> > have a few questions inline.
> > 
> > First, was your original problem not the failure with error: -5? Or did you have
> > both (-5 and -22), depending on the scenario/environment?
> > I am asking because it seems that these two errors occurred at different
> > initialization stages of the tx scheduler. Of course, the series
> > was intended to address both of these issues.
> 
> 
> We had a few issues to work through, I believe the original problem we had
> was -22 (just confirmed) with more than 320 CPUs.
> 

OK. In fact, there were a few problems in the Tx scheduler
implementation, and the error code was dependent on the queue number.
Just different part of the scheduler code was a bottleneck, and the
error could be returned from different functions.

> > > I double checked the patches, they looked like they were applied in our test
> > > version 2025.5.8 build which contained a 6.12.26 kernel with this series
> > > applied (all 3)
> > > 
> > > Our setup is saying max 252 combined queues, but running 384 CPUs by
> > > default, loads an XDP program, then reduces the number of queues using
> > > ethtool, to 192. After that we get the error -22 and link is down.
> > > 
> > To be honest, I did not test the scenario in which the number of queues is
> > reduced while the XDP program is running. This is the first thing I will check.
> 
> Cool, I hope it will help your repro, but see below.
> 

I can now confirm that this is the problematic scenario. I have successfully
reproduced it locally with both the draft and current versions of my series.
Also, if I reverse the order of the calls (change the queue number first,
and then load the XDP program), I do not have a problem.
The good news is that I seem to have already found the root cause of the problem
and have a draft fix that appears to work.

During debugging, I realized that the flow of rebuilding the Tx scheduler tree
in case of an `ethtool -L` call is different from adding XDP Tx queues
(when the program is loaded).
When the `ethtool -L` command is called, the VSI is rebuilt including
the Tx scheduler. This means that all the scheduler nodes associated with the VSI
are removed and added again.
The point of my series was to create a way to add additional "VSI support nodes"
(if needed) to handle the high number of Tx queues. Although I modified
the algorithm for adding nodes, I did not touch the function for removing them.
As a result, some extra "VSI support nodes" were still present in the tree when
the VSI was rebuilt, so there was no room to add those nodes again after
the VSI was restarted with a different queue number.

> > Can you please confirm that you did that step on both the current
> > and the draft version of the series?
> > It would also be interesting to check what happens if the queue number is reduced
> > before loading the XDP program.
> 
> We noticed we had a difference in the testing of draft and current. We have
> a patch against the kernel that was helping us work around this issue, which
> looked like this:
> 
> [...] 
> 
> The module parameter helped us limit the number of vectors, which allowed
> our machines to finish booting before your new patches were available.
> 
> The failure of the new patch was when this value was set to 252, and the
> "draft" patch also fails in this configuration (this is new info from today)
> 
> 
> > > The original version you had sent us was working fine when we tested it, so
> > > the problem seems to be between those two versions. I suppose it could be
> So the problem is also related to the inital number of queues the driver
> starts with. When we
> > > possible (but unlikely because I used git to apply the patches) that there
> > > was something wrong with the source code, but I sincerely doubt it as the
> > > patches had applied cleanly.
> The reason it worked fine was we tested "draft" (and now the new patches
> too) with the module parameter set to 384 queues (with 384 CPUs), or letting
> it default to 128 queues, both worked with the new and old series. 252 seems
> to be some magic failure causing number with both patches, we don't know
> why.
>

During my work on the series, I had a similar observation about some "magic"
limits that changed the behavior (e.g. the error code).
I think it is because increasing the queue count can cause a step change in
the capacity of the Tx scheduler tree.
For example, if the current VSI nodes in the tree are already full, adding just
one more Tx queue triggers the insertion of another "VSI support node" that can
handle 512 more Tx queues.
I guess for 384 queues you could have more (almost empty) VSI support nodes in
the tree which could handle more queues after calling `ethtool -L`. So in such
a case the problem of not freeing some nodes might be masked.

> 
> Thanks for your patience while we worked through the testing differences
> here today. Hope this helps and let me know if you have more questions.
> 
> 
> Jesse
> 

Thanks again for reporting this bug, as it seems to have exposed a serious flaw
in the v1 of my fix.
As a next step, I will send the v2 of the series directly to IWL, where
(in the patch #3) I will extend the algorithm for removing VSI nodes (to remove
all nodes related to a given VSI). This seems to help in my local testing.

Thanks,
Michal

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH iwl-net 0/3] Fix XDP loading on machines with many CPUs
  2025-05-08 14:29       ` Michal Kubiak
@ 2025-05-09 10:07         ` Michal Kubiak
  0 siblings, 0 replies; 10+ messages in thread
From: Michal Kubiak @ 2025-05-09 10:07 UTC (permalink / raw)
  To: Jesse Brandeburg
  Cc: intel-wired-lan, maciej.fijalkowski, aleksander.lobakin,
	przemyslaw.kitszel, dawid.osuchowski, jacob.e.keller, netdev,
	kernel-team

On Thu, May 08, 2025 at 04:29:17PM +0200, Michal Kubiak wrote:

[...]

> > 
> > Thanks for your patience while we worked through the testing differences
> > here today. Hope this helps and let me know if you have more questions.
> > 
> > 
> > Jesse
> > 
> 
> Thanks again for reporting this bug, as it seems to have exposed a serious flaw
> in the v1 of my fix.
> As a next step, I will send the v2 of the series directly to IWL, where
> (in the patch #3) I will extend the algorithm for removing VSI nodes (to remove
> all nodes related to a given VSI). This seems to help in my local testing.
> 
> Thanks,
> Michal
>

Hi Jesse,

I have just sent the v2 of the series which seems to pass my tests.
If you have the opportunity, please verify that it also works in your
environment.

The link: https://lore.kernel.org/netdev/20250509094233.197245-1-michal.kubiak@intel.com/T/#me89fd25db3e9c9101f4030af2c7ce8662686f3a1

Thanks,
Michal

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-05-09 10:08 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-22 15:36 [PATCH iwl-net 0/3] Fix XDP loading on machines with many CPUs Michal Kubiak
2025-04-22 15:36 ` [PATCH iwl-net 1/3] ice: fix Tx scheduler error handling in XDP callback Michal Kubiak
2025-04-22 17:02   ` [Intel-wired-lan] " Loktionov, Aleksandr
2025-04-22 15:36 ` [PATCH iwl-net 2/3] ice: create new Tx scheduler nodes for new queues only Michal Kubiak
2025-04-22 15:36 ` [PATCH iwl-net 3/3] ice: fix rebuilding the Tx scheduler tree for large queue counts Michal Kubiak
2025-05-07  5:31 ` [PATCH iwl-net 0/3] Fix XDP loading on machines with many CPUs Jesse Brandeburg
2025-05-07  8:00   ` Michal Kubiak
2025-05-08  5:51     ` Jesse Brandeburg
2025-05-08 14:29       ` Michal Kubiak
2025-05-09 10:07         ` Michal Kubiak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).