public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] net/mlx5: Flag state up only after cmdif is ready
@ 2025-06-03  6:14 Chenguang Zhao
  2025-06-03 17:25 ` Moshe Shemesh
  2025-06-05  9:19 ` Paolo Abeni
  0 siblings, 2 replies; 4+ messages in thread
From: Chenguang Zhao @ 2025-06-03  6:14 UTC (permalink / raw)
  To: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Moshe Shemesh
  Cc: Chenguang Zhao, netdev, linux-rdma

When driver is reloading during recovery flow, it can't get new commands
till command interface is up again. Otherwise we may get to null pointer
trying to access non initialized command structures.

The issue can be reproduced using the following script:

1)Use following script to trigger PCI error.

for((i=1;i<1000;i++));
do
echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/reset
echo “pci reset test $i times”
done

2) Use following script to read speed.

while true; do cat /sys/class/net/eth0/speed &> /dev/null; done

task: ffff885f42820fd0 ti: ffff88603f758000 task.ti: ffff88603f758000
RIP: 0010:[] [] dma_pool_alloc+0x1ab/0×290
RSP: 0018:ffff88603f75baf0 EFLAGS: 00010046
RAX: 0000000000000246 RBX: ffff882f77d90c80 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 00000000000080d0 RDI: ffff882f77d90d10
RBP: ffff88603f75bb20 R08: 0000000000019ba0 R09: ffff88017fc07c00
R10: ffffffffc0a9c384 R11: 0000000000000246 R12: ffff882f77d90d00
R13: 00000000000080d0 R14: ffff882f77d90d10 R15: ffff88340b6c5ea8
FS: 00007efce8330740(0000) GS:ffff885f4da00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000003454fc6000 CR4: 00000000003407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call trace:
 mlx5_alloc_cmd_msg+0xb4/0×2a0 [mlx5_core]
 mlx5_alloc_cmd_msg+0xd3/0×2a0 [mlx5_core]
 cmd_exec+0xcf/0×8a0 [mlx5_core]
 mlx5_cmd_exec+0x33/0×50 [mlx5_core]
 mlx5_core_access_reg+0xf1/0×170 [mlx5_core]
 mlx5_query_port_ptys+0x64/0×70 [mlx5_core]
 mlx5e_get_link_ksettings+0x5c/0×360 [mlx5_core]
 __ethtool_get_link_ksettings+0xa6/0×210
 speed_show+0x78/0xb0
 dev_attr_show+0x23/0×60
 sysfs_read_file+0x99/0×190
 vfs_read+0x9f/0×170
 SyS_read+0x7f/0xe0
 tracesys+0xe3/0xe8

Fixes: a80d1b68c8b7a0 ("net/mlx5: Break load_one into three stages")
Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn>
---
v3:
 - The recovery process of pci error is mlx5_load_one ->
  mlx5_load_one_devl_locked -> mlx5_function_setup ->
  mlx5_function_enable -> mlx5_cmd_enable. In the mlx5_cmd_enable
  function, cmd->state will be set to MLX5_CMDIF_STATE_DOWN, and when the
  pci error recovery fails, it is the recovery of the entire device, so I
  prefer to use MLX5_DEVICE_STATE_UP.

v2:
 https://lore.kernel.org/all/b8c300f8-bb3b-421f-81c5-f493984f922d@nvidia.com/ 

v1:
 https://lore.kernel.org/all/20250527013723.242599-1-zhaochenguang@kylinos.cn/
---
 drivers/net/ethernet/mellanox/mlx5/core/main.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 41e8660c819c..713f1f4f2b42 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1210,6 +1210,9 @@ static int mlx5_function_enable(struct mlx5_core_dev *dev, bool boot, u64 timeou
 	dev->caps.embedded_cpu = mlx5_read_embedded_cpu(dev);
 	mlx5_cmd_set_state(dev, MLX5_CMDIF_STATE_UP);
 
+	/* remove any previous indication of internal error */
+	dev->state = MLX5_DEVICE_STATE_UP;
+
 	err = mlx5_core_enable_hca(dev, 0);
 	if (err) {
 		mlx5_core_err(dev, "enable hca failed\n");
@@ -1602,8 +1605,6 @@ int mlx5_load_one_devl_locked(struct mlx5_core_dev *dev, bool recovery)
 		mlx5_core_warn(dev, "interface is up, NOP\n");
 		goto out;
 	}
-	/* remove any previous indication of internal error */
-	dev->state = MLX5_DEVICE_STATE_UP;
 
 	if (recovery)
 		timeout = mlx5_tout_ms(dev, FW_PRE_INIT_ON_RECOVERY_TIMEOUT);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread
[parent not found: <1l92ogj6wlz-1l96i9zg23c@nsmail7.0.0--kylin--1>]

end of thread, other threads:[~2025-06-05  9:19 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-03  6:14 [PATCH v3] net/mlx5: Flag state up only after cmdif is ready Chenguang Zhao
2025-06-03 17:25 ` Moshe Shemesh
2025-06-05  9:19 ` Paolo Abeni
     [not found] <1l92ogj6wlz-1l96i9zg23c@nsmail7.0.0--kylin--1>
2025-06-05  8:14 ` Moshe Shemesh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox